Wheat is the world’s most prevalent food source. An improved understanding of how molecular parameters affect the physical characteristics can deliver tremendous rewards in developing better crops. I analyse two huge data sets which fit into an increasingly relevant category—high-dimension low sample size data. Through multivariate statistical techniques, the most influential molecular factors for a particular physical characteristic can be identified.
Wheat is the world’s most prevalent food source, accounting for 18% of the calories, and 20% of the protein consumed by humans according to the UN food and agriculture organisation, and provides a number of crucial vitamins and minerals.
Linking properties of wheat to its genome allows wheat breeders to develop crops with improved disease resistance, growth, and other attributes. There is already a large body of work evaluating crop resistance to pests and environmental conditions and for improving growth
However, another aspect worth improving is the quality of wheat flour. Extensibility—the stretching property of dough—is of key interest owing to its importance in bread and pasta. An improved understanding of what influences extensibility can deliver tremendous rewards in developing better crops.
I had access to two data sets, which shared about seventy variables, corresponding to wheat varieties. The first data set has around ninety variables, primarily measurements of the physical properties of wheat flour, however, the second data set provides molecular information with 13,000 variables, corresponding to protein information about the wheat.
Molecular information, frequently genetic and protein data, typically falls into the category of high-dimension low sample size data—certainly the case for me. This type of data necessitates a different approach, as many classical techniques are based on a small number of variables, or hold true as n goes to infinity.
Multivariate statistics unlocks new statistical techniques, two of which I have used in my report. Principal component analysis uses the dependence between variables in high-dimension data to represent it in less variables, without losing any meaningful information. I have used this to identify the most impactful variables in the ‘physical measurements’ data set, and then linear modelling to identify the variables most related to extensibility.
From here I used a second multivariate technique, canonical correlation analysis to extract the connections between the variables in the two data sets. This allowed for the identification of around 200 variables from the protein data (a mere 1.7% of the original set) which have the most influence on our variable of interest, extensibility.
University of Western Australia