Moving Through Word Embedding Space

For me, mathematics provides a gateway to understanding the world and changing it for the better. With massive breakthroughs we have seen in the AI space, in the last few decades, the necessity for having an understanding mathematics is only growing. This idea, along with my passion for curiosity for learning how things work has sparked my interest is mathematics and data science.

For my research I decided to focus on Natural Language Processing (NLP), which is the branch of computer science concerned with bridging the gap between human and computer language. The work that has been done so far in NLP has been instrumental in the shaping the world we have today. Most of us use the fruits of the work of NLP everyday, through our devices such as Siri and Google Assistant, and more recently ChatGPT.

Word2Vec is an algorithm used to translate large amounts text information (books, news articles and technical reports), in to a form that machines can understand. It does this by converting each word that it is taught into a series of numbers called a vector. As a very simple example, Word2Vec could take the word “Apple” and convert it to [2,4,1,3,2]. While it may be hard for humans to understand a word in this form, computers find it much easier.

How does the computer know what numbers to pick? This is where the magic of Word2Vec comes in. By giving the computer lots of text information, it can learn the relationships between words, and it picks the numbers to represent each based on all of the words. As another example, the words ‘coffee’ and ‘tea’, might appear often with the words ‘cup’ and ‘drink’. The computer can learn that ‘coffee’ and ‘tea’ are related, and gives them similar vectors. The process of converting a word to number is called word embedding. Word embedding allows us to do things with words we normally can’t, like add them together or multiply them. This allows us to do very interesting things, for example:

[King] – [Man] + [Woman] = [Queen]
[Dog] – [Puppy] + [Cat] = [Kitten]

Word embeddings make it much easier for machines to process text along with lots of other information. Word2Vec has already been used to discover new chemical compounds and build recommendation systems. I wanted to help improve the algorithm by examining different techniques that be used to measure the similarity between words. To do this, used a pre-trained Word2Vec model provided by Google, and examined methods of measuring the distance between 2 vectors. The first technique I used was the cosine similarity, which examines the angle between two vectors. The other technique I used was Minkowski distance, which can measure can be used to measure a number of different distances in a single formula, such as Euclidean (a straight line between two points) and Manhattan (a line which follows a grid like pattern). My results indicated that cosine similarity was the best technique to use for measuring similarity between words.

Gabriel Schussler
Western Sydney University