Abstract

As a student who has studied both mathematics and history it has become quite apparent that the two subjects are worlds apart. It does not seem that many historians consider mathematics as useful to the study of history. Likewise, mathematicians do not see their place within the study of history. However, what if mathematicians could apply more complex statistics to help historians more efficiently investigate our past.

Using Mathematics to Untangle World War One Documents

My project looks at finding a way to numerically compare World War One documents and determine which documents are more strongly related than others. Through considering two documents as probability distributions of their word frequencies, methods from information theory can be applied. A symmetric and normalised measure called the Jensen-Shannon Divergence (JSD) essentially measures the distance between two probability distributions. By measuring the distance and determining a significantly low JSD we were able to be able to create a network of the diaries. The nodes of this network were the diaries and an edge between nodes meant that they had a significantly low JSD.

Now that we were able to create a network, the graph object could be put into an algorithm called the-fast-greedy communities to identify dense subgraphs. From this I was able to explore what brought these communities together. To do this I looked at the most important words in each community using TF-IDF. TF-IDF essentially finds which words are important to a community but not to every community.  This is done by scaling the term frequency by what is called the inverse document frequency. The inverse document frequency is larger when the word appears in less of the communities and zero if the word appears in all communities. Additionally, I had to read through the diaries to confirm my inferences and this is where there was the need for ‘traditional’ historical research.

Each community that was found had some form of underlying connection but what was most amazing was the community of two diaries. The first diary was written by a man whose ship was captured by a German raider ship and he was taken as a captive aboard the ship. After some time, this German raider captured another ship. Aboard this ship was the man who wrote the second diary and he was also taken as a captive.

This method of using statistical methods to essentially sort through a collection of around 900 documents meant that I was able to find these diaries that were so closely related without having to read every single one. While this method cannot replace historical research as it is currently, I hope that this method is able to streamline that process and find connections that may have never been made otherwise.

Christina Tait