Vast amounts of knowledge are trapped in presentation media such as videos, html, PDFs, and paper as opposed to being concept-mapped, interlinked, addressable and reusable at fine grained levels. This defeats knowledge exchanges between humans and between human cognition and AI-based systems. It is known that concept mapping enhances human cognition. Especially in domain-specific areas of knowledge, better interlinking would be achieved if concepts would be extracted using surrounding context, accounting for polysemy and key phrases. “You shall know a word by the company it keeps” (Firth, 1957).
Project Aim: In my project I seek to understand models that create good language representations using lexical and semantic structure, at the entity and phrase level.
Word embeddings are unsupervised models that capture semantic and syntactic information about words in a compact low-dimensional vector representation. These learned representations aid reasoning about word usage and meaning (Melamud et al. 2016, p. 1).
Classic or static word vectors assign one vector per word, regardless of context. Skip-Gram (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014) produce these “context-free representations,” collapsing all senses of a polysemous word into one vector representation (Ethayarajh, 2019). This can significantly reduce model performance. For instance, the word “plant” would have an embedding that is the “average of its different contextual semantics relating to biology, placement, manufacturing, and power generation” (Neelakantan et al., 2015). Fortunately, bidirectional language models (biLM) capture forward and backward history as a contextual word embedding (CWE), which assigns multiple vectors per word depending on context, effectively capturing polysemy of that word. Static embeddings represent “lookup tables” while contextual embeddings have word type information.
Previous contextualisation models ULMFiT and ELMo use a biLM to capture left and right context, though not at the same time, which blocks sentence information across positions from being used simultaneously. Instead of using ELMo’s “shallow” concatenation of “independently-trained” biLMs, the BERT (Bidirectional Encoder Representations from Transformers) model jointly conditions on left and right context in all layers, resulting in a deeper bidirectional context (Devlin et al., 2019). BERT learns dependency syntax since its attention heads detect “direct objects of verbs, determiners of nouns, objects of prepositions, and objects of possessive pronouns with > 75% accuracy” (Clark et al., 2019). BERT’s other attention heads perform coreference resolution (CR) which is a more challenging nlp task since coreference links span longer than syntax dependencies. The fact that BERT learns these things via self-supervision may explain its success.
Since Transformers take fixed-length inputs, they forget context. Dai et al. (2019) created Transformer-XL (extra long) to learn longer-spanning dependencies without “disrupting temporal coherence.” Instead of breaking up sentences into arbitrary fixed lengths, the Transformer-XL respects natural language boundaries like sentences and paragraphs, helping it gain richer context over these and even longer texts like documents. It uses a novel architecture comprised of a segment-level recurrence mechanism and relative positional encodings.
XLNet recognises BERT’s flaws: (1) False Independence Assumption: BERT factorises its log likelihood probability assuming all masked tokens are rebuilt independently of each other (so BERT ignores long-term textual dependencies), and (2) Data Corruption: Masked tokens do not appear in real data during fine-tuning, so since BERT uses them in pre-training, a discrepancy between these two steps arises. XLNet benefits from both autoencoding and autoregressive language modelling while avoiding their issues. XLNet uses an autoregressive model so the probability of a token is factorised with universal probability rule, avoiding BERT’s false independence assumption. Secondly, XLNet’s permutation language model captures bidirectional context and uses two-stream attention for rendering target-aware predictions.
ERNIE 1.0 uses implicit knowledge in entities to extrapolate the relationships between named entities in sentences. This lets ERNIE 1.0 fill-in-the-blanks on named entities within paragraphs, while BERT often cannot predict the correct semantic concept. ERNIE leverages a Transformer encoder with self-attention alongside knowledge integration techniques like entity and phrase-level masking so prior knowledge contained in conceptual units like phrases and entities contribute to learning longer semantic dependencies for better model generalisation and adaptability (Sun et al., 2019a). ERNIE 2.0 is another state of the art model that uses continual multi-task learning, inspired by the persistence of human learning, to “include more lexical, syntactic and semantic information from training corpora in form of named entities (like person names, location names, and organisation names), semantic closeness (proximity of sentences), sentence order or discourse relations” (Sun et al., 2019).
University of New England