This post is a how-to guide to apply ARIMA models to time series. I’ll use it as a quick reference/reminder for myself and hope somebody out there find it useful too.
if you are looking for a thorough manual I recomend this online book
The data I am using for this post (Age of Death of Successive Kings of England) has been made available by Rob Hyndman in his Time Series Data Library at http://robjhyndman.com/TSDL/.
- Get your data into a ts object first
Once data is in the right format then we start our analysis.
1 Plot data and examine. Do a visual inspection to determine if your series is non-stationary.
2 Examine Autocorrelation Function (ACF) for stationarity. The ACF for a non-stationary series will show large autocorrelations that diminish only very slowly at large lags.
Note: Don’t even bother looking at the Partial Autocorrelation Function if the time series is non-stationary.
At first glance we could say it is non-stationary but let’s run a unit-root test.
In Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test, the null hypothesis is that the data is stationary. In this case small p-values (less than 0.05) suggest that differencing is required.
With a KPSS p-value of 0.0144 we reject the null hypothesis of stationarity.
3 If not stationary, take first differences
Note: If variance appears non-constant, take logarithm before first differencing
Now both the ACF plot and the KPSS test (p-value = 0.1) suggest that the time series is stationary.
Before going into the Model Identification and Estimation it is advisable to check for white noise on your stationary series, because if your series is white noise there is nothing to model and there is no point in carrying out any estimation or forecasting.
For this we can carry out a Ljung-Box test. This can be done in R using the “Box.test()”, function.
Here the Ljung-Box test statistic is X-squared = 25.1,(p-values shown for the Ljung-Box statistic plot are incorrect see here) so we need to compare the observed X-aquared versus a X-squared with 20 d.f. and critical value at the 0.05.
Observed Chi-squared 25.1 < 31.41 so there is no evidence of non-zero autocorrelations (white noise) at lags 1-20.
Model Identification and Estimation
#### 1. Examine the ACF and PACF
Examine the Autocorrelation Function and Partial Autocorrelation Function of your stationary series to understand what ARIMA(p,d,q) models to estimate. The d in ARIMA stands for the number of times the data have been differenced to become to stationary. In this case d=1.
The p in ARIMA(p,d,q) measures the order of the autoregressive component. To get an idea of what orders to consider, examine the partial autocorrelation function (PACF). If the time-series has an autoregressive order of 1, called AR(1), then we should see only the first partial autocorrelation coefficient as significant. If it has an AR(2), then we should see only the first and second partial autocorrelation coefficients as significant. (Note, that they could be positive and/or negative; what matters is the statistical significance.) Generally, the partial autocorrelation function PACF will have significant correlations up to lag p, and will quickly drop to near zero values after lag p.
The above PACF plot suggests an AR(2) or AR(3) could be adecuated in this case.
The q measures the order of the moving average component. To get an idea of what orders to consider, we examine the ACF. If the time series is a moving average of order 1, called a MA(1), we should see only one significant autocorrelation coefficient at lag 1. This is because a MA(1) process has a memory of only one period. If the time series is a MA(2), we should see only two significant autocorrelation coefficients, at lag 1 and 2, because a MA(2) process has a memory of only two periods. Generally, for a time series that is a MA(q), the autocorrelation function will have significant correlations up to lag q, and will quickly drop to near zero values after lag q.
The ACF suggests a MA(1) as only lag 1 exceeds the significance bounds.
We have identified p,d,q so our potential models can be:
- ARIMA(3,1,0) since the PACF is zero after lag 3
- ARIMA(0,1,1) since the ACF is zero after lag 1
- ARIMA(3,1,1) a combined model of previous models.
From the point of view of parsimony we should pick the model with the least number of paramater in this case the ARIMA(0,1,1) but personally I’d like to try all models and select the model with the lower AIC (AIC penalizes the number of paramaters in the model).
In this case ARIMA(0,1,1) seems a good candidate for our Age of Death of Successive Kings of England time series.
Note: This part of the post is literally a copy & paste from Avril Coghlan’s online book (with minor changes) I mentioned at the begining of this post.
Let’s use our ARIMA(0,1,1) to make forecasts for future values using the forecast.Arima() function in the “forecast” R package.
The ARIMA model gives the forecasted age at death of the next five kings as 67.8 years.
It is a good idea to investigate whether the forecast errors of an ARIMA model are normally distributed with mean zero and constant variance, and whether the are correlations between successive forecast errors.
Since the correlogram shows that none of the sample autocorrelations for lags 1-20 exceed the significance bounds, and the X-squared for the Ljung-Box test is 13.58, we can conclude that there is very little evidence for non-zero autocorrelations in the forecast errors at lags 1-20.
To investigate whether the forecast errors are normally distributed with mean zero and constant variance, we can make a time plot and histogram (with overlaid normal curve) of the forecast errors: