A time series is a collection of random variables indexed according to the order in which they are observed in time. The main objective of time series analysis is to develop statistical models that can describe the data and then forecast the future behaviour of the system. An autoregressive (AR) model is an appropriate model for analysing stationary time series and despite its simplicity, is widely used in a range of applied fields spanning from genetics and medical sciences to finance and engineering.
However, in the age of “big data”, estimating the parameters required to fit an autoregressive model involves the computation many potentially large-scale ordinary least squares (OLS) which can be the main bottleneck of computations. To alleviate this problem, algorithms incorporating methodology from randomised numerical linear algebra (RandNLA) are able to ease the computational burden by sampling a portion of the data matrix while still keeping many of its important properties. My project looked at empirically comparing two algorithms, based on RandNLA architecture, numerically with both synthetic and real-world big time series data. The two algorithms I compared were LSAR and an algorithm with three variations which we called DW1-DW2-DW3. In particular, I examined the differences in PACF times, an important diagnostic tool in selecting the correct order of the AR model, and the relative error of the parameter estimates.
To achieve this, I worked in MATLAB which was a new experience to me since I had only ever worked in R, however, was more appropriate considering the large amount of matrix work involved in computing large-scale OLS problems. I soon saw the similarities and differences between MATLAB and R and was able to adjust comfortably enough. Some pre-written MATLAB code was given to me that I then modified and added too where necessary.
The main goal of my summer research project was to empirically compare the performance of the LSAR algorithm against all 3 variations of the DW1-DW2-DW3 algorithm. To achieve this, I needed to run the comparison many times. I had 3 comparisons to run, 3 synthetic AR models (AR(4), AR(40), AR(100)) that I would generate data sets from, 10 data sets generated from each of the three AR models, and 100 replications of each of the three comparisons per data set, making a total of 9000 comparisons. Performing this purely through MATLAB would have not been plausible. I was fortunate enough to be exposed and use the University of Newcastle’s Research Compute Grid (RCG) to simultaneously run these comparisons at the same time. This also briefly exposed me to working in Linux, which I think was a valuable experience. An empirical comparison using real-world data was also run, replicating each comparison between LSAR and DW1, DW2 and DW3 1000 times for a total of 3000 comparisons.
When the data was averaged out and smoothed over all the comparisons, numerical results revealed that the LSAR algorithm outperformed the DW1-DW2-DW3 algorithm in terms of faster PACF times and smaller relative error of parameter estimates, on average.
Thomas McCarthy McCann
The University of Newcastle
