The amount of data in the world is growing exponentially. The question is, can we use machine learning to find new information in these huge and new data sets. Like humans, machines can see patterns that don’t mean anything useful. How do machines make these sorts of mistakes and how do we avoid making them?
All this data may be useful to human society, but one of the challenges of making use of it is in dealing with the sheer volume of it. Clearly exponentially growing data cannot be comprehended by the human mind alone. The question is, can we use machine learning to find new information in these huge data sets. It seems that we can, but must be aware of the short comings of machine learning as we approach this problem.
Like humans, machines can see patterns that don’t mean anything useful. Humans are very good at seeing faces in visual data when faces shouldn’t be there. Machines are very good at finding patterns that don’t mean anything useful
Here we look at several algorithms applied to the Iris Flower data set; K-Means, DBSCAN and Gaussian Mixture Models (GMM) with Expectation Maximisation (EM). We find the GMM with EM works best. Each of these algorithms groups the data points in to separate partitions. We call this process clustering, and the different partitions are called clusters.
You might then ask, why not always use GMM with EM? That is a good question. The answer lies in understanding that the nature of the data is always a little bit different. Sometimes the types of clusters that we’re looking for cannot be found by K-Means. It turns out the K-Means can only find clusters of a particular shape. These are circular, centre based clusters. To find clusters of different shape, and algorithm called DBSCAN is useful. This finds clusters that relatively continuous, regardless of their shape. DBSCAN does this by looking at density. The user tells DBSCAN what ‘dense’ looks like and then DBSCAN identifies all continuously dense areas.
Of interest is GMM with EM is effectively just a more general version of K-Means. So why have K-Means at all? One answer is that K-Means runs faster than GMM with EM.
The take-home message is that clustering data with machine learning requires that the user understand something about the data that she is looking at.
Alex Oakley
University of Tasmania
