My summer research consisted of comparing the performance of different statistical methods for sparse factor models. Sparse factor models are capable of finding meaningful, hidden variables which govern the data behind the scenes. Applying this to gene expression data enables us to learn more about the biological processes that regulate our genes.
For my AMSI Vacation Research Scholarship, I compared the performance of different statistical methods for a dimension reduction technique – sparse factor models. Nowadays, large datasets are available everywhere, including in the scene of computational biology. A common topic in this field is the analysis of gene expression. All cells of the same organism have the same genes, but they each have different traits. This is because the genes could be amplified or suppressed differently across the cells, often because of some biological process behind the scenes. The way that genes express themselves provides us the opportunity to learn more about these biological processes, which may help us to combat diseases more effectively.
Using sparse factor models on gene expression data can enable learning about these biological processes. Imagine a large matrix with numerous rows of gene expression measured for different individuals or cells. Sparse factor models attempt to summarise this data by a product of two much smaller matrices – one for linking the genes with hidden variables, another for weighting these hidden variables for each individual or cell. These smaller matrices have the potential to resemble real biological processes, provided that some of the links between genes and hidden variables are actually zero. This is because each biological process should only control a few genes, not all the possible genes. Sparse factor models make this possible, as the linking matrix may be sparse.
The question now is, given the gene expression data, what do the smaller matrices look like? This is a problem of Bayesian inference. Many inference techniques are available, and my research was to compare the accuracy and computational speed of two techniques: the traditional ‘Markov chain Monte Carlo’ (MCMC), and the more recent ‘variational inference’ (VI). MCMC is a sampling technique, which has a theoretical guarantee of being accurate, yet in practice takes a long time to run. This is because it has to explore the relevant space of possible matrices sufficiently to give accurate results. On the other hand, VI is an optimisation technique that finds a simple distribution to describe the possible matrices, which runs more quickly, but at the cost of losing some accuracy.
After some implementing both techniques and comparing them using numerical simulations, I found out that the difference in accuracies between the two techniques was not much, and so the faster approach of variational inference is preferable. I look forward to applying this statistical technique to analyse more biological datasets, and hopefully find some interesting results.
Yong See Foo
The University of Melbourne
