Like so many things that are being automated nowadays, computers understanding our language is one of them. It’s funny because a only a few decades ago programmers had to code in assembly languages, essentially taking directly to computers and now look how the tables have turned!
The big tech companies have been doing it for quite some time now. Siri, Cortana and Alexa are just a few examples. They recognize your speech and convert it into text and then make sense of that text. Similarly when you’re Googling a question, it’s able to understand the question and give you (more or less) accurate answers directly, or at least guide you in the correct direction by being able to understand the text present in the millions of web pages. How do they do it?
My project was addressing a similar question, but on a smaller scale. How can we detect racism in Australian social networks? Instead of analysing web pages, I analysed tweets using a machine learning algorithm called Word2vec. As the name suggests, it converts words into vectors of real numbers. But how does one even start putting a value on a word? Using their word length, part of speech or frequency? It starts with finding the meaning of the word, not off of a dictionary but rather the other words that are surrounding it in a sentence. In other ‘words’ the algorithm looks at the context of the word and seeks meaning in it.
The quick brown fox jumps.
The quick red fox jumps.
Here ‘red’ and ‘brown’ are both surrounded by similar words. This could mean that ‘red’ and ‘brown’ are similar in some way to the algorithm. It could be done in one of two ways. It can either look at the surrounding words of our target word and guess which word might fit in, or just look at the word itself and guess which other words might make sense surrounding it. The former method is known as CBOW and the latter Skip-gram. Either way after going through millions of cases, in my case tweets, my model was able to answer questions such as ‘What’s the closest word to Vegemite?’—Nutella with 87% similarity and ‘Which is the odd one out amongst Sydney, Melbourne, Auckland and Brisbane?’—Auckland it says. It was quite fascinating when I first tried these out, because not only did it make sense of words but gave answers to questions based in the Australian context. Later on in the project, we looked into visualising and plotting these word vectors into a graph and seeing where the racist words lie. As such we hope to classify if and where racist sentiment is prevalent in our model to better detect racism in Australian social networks.
Sajit Gurubacharya was a recipient of a 2018/19 AMSI Vacation Research Scholarship.