Word embeddings are a way to encode semantics of words in numerical form. Distributional semantics allow us to represent words by/in their context, as opposed to lexical word vector encodings which do not allow any way to compare similarity of words. Evolving algorithms such as BERT and ELMO, have proven superior to original algorithms by focusing on concepts and phrases, not just singular words. They also focus on contextualised word embeddings, which use word embeddings and distributional semantics simultaneously for encoding concepts that are related at the semantic level. Thus, contextualised word embeddings bridge the human understanding of language to that of a machine and are essential for solving most Natural Language Processing (NLP) problems.
Aims of this research:
1. inventory, study and compare algorithms and frameworks for various kinds of word embeddings: multi-sense word embedding, context embedding, contextualised concept embedding, and contextual string embedding
2. Inventory, study and apply to the natural language processing (NLP) tasks that benefit from distributional semantics
End Goal: to perform concept extraction and named entity recognition (NER) in the domain of probabilities & statistics and machine learning, using textbooks and papers. The project will then apply these algorithms to perform concept extraction from documentation of frameworks and libraries that use variety of programming languages, such as Scala, Matlab, Python and R, to make their knowledge accessible.
University of New England
International citizen, lived and studied in Canada for a majority of her life before returning to native Europe, Ana-Maria is a diligent student who thrives in well-defined scientific domains.
During high school, she discovered a passion for mathematics. Her interest blossomed while verifying trigonometric identities, and while studying volume integrals and related rates in her calculus year. She successfully concluded high school with cumulative GPA of 3.92, AP Scholar Awards in Calculus BC, Statistics, and Computer Science in Java, and honours in Spanish, French, and literature analysis in English. On the programming side, Ana-Maria manages numerous Github repositories. She initiated projects to self-learn functional programming languages Scala and Haskell, using Specs2 unit testing framework to specify functionality. In developing a linear algebra library project, she described theoretical concepts like basis and independence using Scala, which she believed missing from existing linear algebra packages. Prior to university, she explored time series and econometrics concepts in statistical package R. Additionally, she wielded Wolfram’s Mathematica to document inter-relationships between probability distributions in a ProbOnto-inspired style. When she absolves her master, she plans to publish this code project as a wiki-book. Ana-Maria maintained her above ninety-percent grades during her Bachelor of Mathematics and Statistics at the University of New England in Australia. She plans to leverage her study of topology, complex variables, and abstract algebra to pave the way for category theory in data science. Simultaneously, she delved into applied statistics like generalised linear models and neural networks algorithms like Hamiltonian MCMC and error backpropagation. Combining programming experience in Java, Scala, Haskell, and Python, she will study machine learning by categorising natural language processing algorithms during this research scholarship. For the future, Ana-Maria envisions her own NLP business to glean text semantics for global clients, and for financial analysis of currency exchange markets.