A brief introductory note into the contents of my AMSI research. We discuss the motivation for the project, as well the overarching aim of the research. The necessity for artificial test data generation is contextualised in relation to the current technological environment, and our approaches to create very specific types of classification data are introduced.
Recent decades have seen large advancements in the development and application of machine learning algorithms. Objective performance evaluation of ML techniques is paramount in justifying their real-world use. Worryingly, potential inadequacies in the current accepted methodology for testing classification algorithms have been highlighted in a number of recent papers.
In this research project we focus on artificially generating new test datasets (test instances) that may contribute to the diversity of existing data repositories and strengthen a “test instance space”; a 2D projection of high-dimensional feature vectors encoding certain qualities of a dataset. The instance space allows algorithm performance to be visualised across a span of varying test instance properties, and pockets of the instance space corresponding to algorithm strengths and weaknesses to be easily identified. The lack of variety in existing repositories prevents the generation of a test instance space capable of finely separating algorithms by performance. For example, fundamentally different algorithms (KNN, SVM, RF) perform similarly across most current instances. Further, there are regions in the space in which the number of instances is scarce.
We explore data generation as a multi-objective optimisation problem, which is solved using genetic algorithms. A genetic algorithm (GA) is a metaheuristic inspired by the processes of natural selection that seeks optimal values for an objective function. In a GA, a population of candidate solutions is adjusted over many generations (iterations), to evolve toward solutions with higher fitness (better solutions to objective function). Like in natural selection, ‘fitter’ individuals are carried over to the next generation. In addition, mutation and crossover operators reestablish population diversity in future generations. Using a GA, we create classification datasets that are particularly difficult or easy for certain classifiers.
The report discusses an approach by which to create a large cohort of discriminating test instances. Creating and projecting such a set on to an instance space will allow algorithm performance to be understood at greater depth than what is currently afforded through existing repositories. For example, where in an instance space do datasets optimised for certain classification strategies tend to be? What does that imply about the properties of such datasets, and the behaviour of other classifiers in this region? Building a more complete instance space in this way can allow for refined analysis of the strengths and weaknesses of new and emerging ML algorithms. In this way, we create a valuable tool for stress-testing ML algorithms.
Kulunu Dharmakeerthi
The University of Melbourne
