A DNA profile is valuable to eliminate certain persons of interest from a forensic case. They do not prove that a sample collected from a crime scene contained a person’s DNA, rather it eliminates them. In forensics, they are produced from tissue samples and biological fluids collected from crime scenes or directly from a victim or person of interest. The profile is developed by an electropherogram, which produces a graph with a series of peaks, each peak corresponding to an allele call. Often, the sample is degraded, contains low copy DNA or a mixture of possible contributors making obtaining the profile a challenge.
Currently, only the heights of the peak are deemed acceptable for determining a profile. The height of a peak must exceed both an analytical threshold to be called a true peak, as well as a stochastic threshold in order to be included in the profile. This is due to some peaks in the electropherogram output being stutter or pull ups from PCR artefacts. These are still recognised as peaks, but do not contribute to the profile.
The aim of the project was to try and use more information from the raw data to increase the readability of the electropherograms by incorporating peak area, kurtosis and peak base width as well as height to identify the true peaks that fall just below the currently accepted thresholds.
Difficulties with the project soon became apparent after weeks of searching, downloading and then using many programs to try and extract the raw data from the files produced by the Applied Biosystems PCR machine and software as they are unique to the company. Potentially, the most appropriate program was too expensive and not attainable for this project.
An alternative approach was then taken, still with the same aim to try and get more information out of the electropherograms. However, instead of trying to identify the allele calls, attention was restricted to simply identifying the baseline signal from the peaks, like the approach by Taylor et al (2016). Artificial neural networks (ANNs) were used to classify each scan as either a baseline or peak with data obtained from the PROVED lt (Project Research Openness for Validation with Empirical Data) project available at the Laboratory for Forensic Technology Development and Integration database found at lftdi.com. Data to train the ANN were extracted from untreated single profile DNA samples.
In contrast, the test data were from samples that had been subject to ultraviolet exposure, and therefore likely contained degraded DNA. The results were promising with approximately a 99% correct classification of the baseline and peaks from the test data.
These results are a great initial finding for the forensic and statistics community and it would be great to see this further researched in the coming years. The original approach would still be worthwhile exploring in the future also, given that the data can be extracted from the electropherograms. Increasing the accuracy of developing DNA profiles for use in the judicial system is a benefit to the wider community and great advancement to an already well recognised procedure.