(A Validation of Dr. Keenan's Model)
The latest contestants to enter the climate debate stage are
mathematician Douglas J. Keenan and the UK's Meteorological Office or Met for
short. On November 8, 2012, Lord Donoughue, a member of the UK Parliament posed
the question:
To
ask Her Majesty's Government ... whether they consider a rise in global
temperature of 0.8 degrees Celsius since 1880 to be significant. [HL3050].
The Met Office answered, "the temperature rise since about
1880 is statistically significant," meaning that the temperature rise since 1880 is attributable to human caused global warming.
Keenan challenged the Met Office's reply to Parliament on
the basis that the model chosen by Met's leading statistician, Dr. Julia Slingo,
was not suited to answer the question. Keenan offered up his own choice of
model that he claims was much better suited. As you might imagine, a series
back and forth opinion editorials and commentaries ensued which sparked a rather
heated debate.

Douglas Keenan – Editor of
InforMath.org
Much has already been written about the tone of the debate,
Met's failure to provide specifics on their claims of statistical significance,
and of Keenan's brazen indictment of Met's Chief Executive Officer, John Hirst
and leading scientist Dr. Slingo. It is the ruckus between Keenan and the Met
that has captured most people's interest and is the basis of most articles
on the matter.
Being a numbers guy, I'm less interested in the tone of the
debate and more interested in the models that the Met and Keenan espouse. I'll
lay out the strengths and weaknesses of the two models and offer a closing
assessment of which model is more suited to determining the statistical
significance of the temperature rise of 0.8 degrees C since 1880.
Statistical significance is a somewhat nebulous term when it
comes to non-stationary time series data such as the temperature record. Most measurements of processes in nature
will show statistically significant results if the time series is long enough
simply because Mother Nature does many cyclical or curvilinear things in big
ways. The following graphic illustrates this.

The Pacific Decadal Oscillation
is a natural cycle with significant trends
In medical research, statistical significance is usually
determined by comparing a control group to an experimental group. However, in
climate research there is no control group, making the testing for statistical
significance a more mathematically tangled affair, which sparks much
disagreement among scientists on how to best do it.
The central debate between Keenan and the Met Office
revolves around which model provides a better "goodness of fit" (GOF) to the
temperature record and has the best predictive skills from which to infer
statistical significance. Rather than focus on the question of statistical
significance in post-industrial temperatures, which I'll leave to Met and
Keenan to debate, this article takes an objective look at the two models they
espouse and why one model is better suited to determine statistical
significance.
The Met uses the AR(1) model. They claim their model shows
statistical significance in current temperature trends, inferring recent
temperature trends are being caused by human activities. Keenan's charge is
that AR(1) is not robust enough to make such determination. He espouses the
ARIMA(3,1,0) model which he claims is more robust. He claims his model shows
today's temperature trends are not statistically significant and more likely
caused by natural variability.
Decomposing The Models
Both models begin with the letters AR, meaning that both
models employ Auto-Regression, a stochastic process that
weights the sum of previous measurements in the time series and white noise
error to produce a predictive output for the next measurement. In the case of
Met's AR(1) model, the one (1) means it is a "first-order autoregression" model.
In practical terms, first order means the temperature from January of last year
has a direct impact on January's temperature for this year but the temperature
from January two years ago or earlier has no direct relationship to this year.
The use of the AR(1) model is premised on the assumption
that changes in temperatures have only first order relationships as described
above. That's a rather bold and unjustified assumption in my opinion. There are
known natural cycles and events, lasting from a few years to many decades,
which affect the earth's temperature. The notion that temperatures more than
one year back can have some effect on today's temperature seems intuitive to
me. This is one of Keenan's criticisms of the AR(1) model – a criticism
that I agree with.
The ARIMA(3,1,0) model proposed by Keenan is parameterized
to use third-order autoregression as indicated by the value three (3). Keenan's
model assumes the three previous temperature measurements and the white noise
term contribute to the current temperature measurement. If the sum of the three
previous temperature measurements is positive, the output will resemble a low
pass filter with the high frequency part of the white noise decreased in
amplitude. We'll see that low pass filtering effect when we start looking at
comparative graphs of AR(1) vs. ARIMA(3,1,0) model output.
There are three parts to the model: "AR" (AutoRegression), which has been discussed above. The "I" (Integrative) term indicates that the
model applies an integrative function to change the data from non-stationary to
stationary structure to make it more suitable for autoregression. Keenan's
model applies one (1) order for differencing. The "MA" (Moving Average) term
indicates the model applies a moving average to its output. Because Keenan's
value is zero (0), no moving average is applied to the model's output.
Keenan's model is flexible and more adaptive but considerably
more complex. This is the chief criticism of Met regarding Keenan's choice of
model. I'll be exploring that criticism shortly.

The Battle of the Met's
AR(1) and Keenan's ARIMA(3,1,0) Models
To determine "the better model," there are a number of
objective measures we can look at to assess the model's parameterization,
goodness of fit to observed temperatures, and predictive skills. In Keenan's
supplementary material, he explains that he tested the Met's AR(1) and his
preferred ARIMA(3,1,0) models against the GISS temperature series. In Met's
response to Parliament of Keenan's criticisms, they used the HadCRUT4
temperature series.
I personally don't think which temperature series is used to
test the models matters. If it does, then we have other systematic issues to
deal with that go beyond the choice of models. For the purpose of this article,
I'll use the latest GISS GHCNv3 temperature series to compare the two models.
All of my statistical analysis is done using the statistical package R with the
"Forecast" library – a package designed for AR and ARIMA modeling.
Let the Analysis Begin!
Test for Statistical Significance
– I performed appropriate tests to determine if the results I was
getting were, themselves, statistically significant and could be relied upon. Again,
I'm not testing the statistical
significance of the temperature series but rather I'm testing the reliability
of the tests I'm performing on the models. This is standard practice in
mathematics. In most cases an ANOVA test was performed. In all cases, I tested
at the 95% Confidence Interval (alpha
= 0.05, p < 0.05). The tests I
performed on both models were as follows:
Sigma^2 - When we
compare the fit of the models to the data, we look for the smaller standard deviation
of the residual (error) of the model. The residual represents the portion of
the data the model can't explain. By "can't explain," I mean the model fails to
account for. The lower the residual, the better the model fits to the data and explains
it. The standard deviation of the
residual is called the Sigma^2.

Akaike Information
Criterion test with correction (AICc) – AICc is a method of testing a
given model from a set of models. The best model is one that minimizes the
Kullback-Leibler distance between the model and the data. Kullback-Leibler
distance measures the difference between the probability distributions of the
model as compared to the actual temperature series distribution. The model that
has the best GOF (or least divergence between the two distributions) is the better
model.

Bayesian Information Criterion
(BIC) – The chief criticism Met levies against Keenan's choice of
model is the greater number of parameters used in his chosen model makes it unnecessarily
complicated and prone to error. The
BIC test is similar to the AICc test but penalizes the model for over-parameterization
(being unnecessarily complicated for the data under analysis). It is a better
test when comparing models with differing orders of parameterization as in our
testing.

Coefficient of
Determination (R2) and Residual (1-R2) - The R2
measure determines what percentage of the variance in the trend produced by the
model output can be explained by the model. It is a measure of how well the
model is designed and parameterized to account for the variance and trend in
the data. The residual of the coefficient, 1-R2, tells us what
percentage of the variance in the model trend is due to residual (error) and is
not accounted for by the model. The higher the R2 and the lower the 1-R2,
the better the model is designed and parameterized.

F Ratio - The F ratio
is part of the ANOVA test of the model's performance. It tells us the ratio of the
variability of the trend explained by the model to the unexplained variability
of the trend, each divided by the degrees of freedom (df) of autoregression used
by the model. The higher the F ratio, the more useful the model is to analyze
the temperature data.

Lag Correlations
– The lag correlations tell us how well the parameterizations of the
models perform their autocorrelation to the time series data. There are two
objective measures of lag correlation: ACF (residual AutoCorrelation Function) and PACF (residual Partial AutoCorrelation Function). Collectively, these two measures
determine how well the parameters to be used for the "AR", "I", and "MA"
functions perform for the respective models. They can also be used to test for
the best parameters to be used. ACF and PACF are best analyzed visually in
graph form.

I could write a book about how to interpret ACF and PACF.
So, to keep it simple we'll look at one important aspect of the above graphs.
Met's AR(1) model is on the left side with ACF at the top and PACF at the
bottom. Keenan's ARIMA(3,1,0) model is on the right. Notice that each graph has
dotted blue lines. This is referred to as the cutoff band. When a model's
parameterization is optimum, its lag values will remain inside the cutoff. A special note is in order here. In R, a Lag
0 line is drawn for the ACF graph that extends well outside of the cutoff
lines. This is a well-known defect in the ACF graphics function so it can be
ignored.
Note that Met's AR(1) model makes excursions outside of the
cutoff lines. Keenan's ARIMA(3,1,0) model's lag values remain inside the cutoff
lines, indicating that Keenan's model autocorrelates
better to the time series data.
Diebold-Mariano Test
– This tests the null hypothesis that the two models perform equally as
well against the same time series data. The test compares the residuals (error)
of each model. The results of the Diebold-Marino test are as follows:
DM = -2.4648
Forecast horizon = 1
Loss function power = 2
p-value = 0.01499
alternative hypothesis: two.sided
The Diebold-Marino test is statistically significant (p = 0.015), meaning we reject the null
hypothesis that both models perform equally. The DM value is negative,
indicating that Keenan's ARIMA(3,1,0)
model performed better than Met's AR(1).
The Box-Ljung Test
for Residual Dependency – The Box-Ljung test is a diagnostic tool
used to test the null hypothesis that the model does not exhibit a lack of fit
to the time series. The test is applied to the residuals of a time series after
fitting an ARIMA(p, d, q) model to the data. The test examines autocorrelations
of the residuals for dependency. If the autocorrelations are very small, we
conclude that the model does not exhibit significant lack of fit.
This is somewhat inverse logic and a little bit difficult to
follow. Ideally, you want the model to fail the Box-Ljung test (have a p-value
> 0.05) supporting the null hypothesis that the observed fit in the model is
due to the model's predictive skills (I'm oversimplifying here but hopefully you
mathy people get the point).

The Proof Is
In the Tasting of the Pudding
So far, I've thrown out a bunch of model tests and made my
best attempt to explain what they mean. If you've reached this place in the
article, congratulations, you managed to stay awake, LOL! For those who found
the above boring, stick around. It's going to get more fun from here.
The proof is in the tasting of the pudding. Both the Met's AR(1) and Keenan's
ARIMA(3,1,0) models are time series predictive models, meaning they are
mathematical crystal balls that can tell the future, given some hint of the
past.
This is what the models render when we plug all of the
temperature data into them:

The actual GISS GHCNv3 temperature trend is the black line
in the graphic. Met's AR(1) model output is in red. Keenan's ARIMA(3,1,0) model
output is in green. If you study this graph closely, you might notice something
odd towards the current timeline in the graph. Here, let me show you:

One of the criticisms of the AR(1) model is it's ability to
follow a non-stationary trend falls off as the time series gets longer. Above,
you can see that up until 1960, the AR(1) model tracks the actual temperature
trend fairly well. However, as we approach the year 1990 and forward, it starts
to drift negative in relation to the actual trend. It is a "drifting" model
whereas Keenan's ARIMA(3,1,0) is driftless.
Earlier in this article, I mentioned that the ARIMA(3,1,0)
model uses third order auto-correlation. In mostly positive or mostly negative
trends it tends to act somewhat as a low pass filter, meaning it favors the
lower frequencies of the trend and tends to be less reactive to the high
frequency variability components of the trend. You can see that the green line
of the ARIMA(3,1,0) model is less variable than the AR(1) line. The key
performance indicator of these models is how well they predict the trend and
not so much how well they account for high frequency components in the trend.
Now, we're going to see how well the two models serve as
crystal balls. In this next test, I removed the years 2002 through 2012 from
the GISS GHCNv3 temperature series. I intentionally selected 2002 as my
truncation date because the years from 1997 through the end of 2001 provide a
"hint" that the rate of global warming was beginning to stagnate. I wanted both
models to have equal opportunity to get the hint.
The following two graphics overlay the model's predictions
over the actual temperature trend which includes the 2002 through 2012 actual
temperature anomalies.
First up, Keenan's ARIMA(3,1,0) model prediction for the
years 2002 through 2012:

And now, Met's AR(1) model's prediction of temperatures from
2002 to 2012:

In both graphics, the dark blue lines are the model's
temperature prediction. The light blue bands around the prediction lines are
the 95% Confidence Interval. The still wider gray bands are the 90% confidence
Interval.
It is obvious that Keenan's ARIMA(3,1,0) model took the hint
I provided and modeled the temperatures from 2002 to 2012 with an accuracy of
close to 0.05 deg. C. Met's AR(1) model completely missed the hint and assumed
that temperatures would follow the momentary down trend that was occurring at
the truncation of the temperature series.
An interesting surprise to me was that Keenan's ARIMA(3,1,0)
model actually predicted the two peaks and one trough in temperature trends
during the prediction period, albeit somewhat muted as expected. You can see
the two small peaks and single trough in the dark blue line in the ARIMA(3,1,0)
model prediction. Well done!
Is There a Better Model?
A model can always be improved upon. We hear that from
climatologists all the time, right? Well, actually it's true. Referring back to
my discussion of ACF and PACF, I played around with the ARIMA model class and
found a set of parameters that performed slightly better than Keenan's
ARIMA(3,1,0) model. My model was the ARMIA(0,1,1).
When I say slightly better, I mean my ARIMA(0,1,1) model
failed several tests for significance in performance against Keenan's ARIMA(3,1,0)
model, meaning that while some of my key performance metrics were ever so
slightly better, they weren't better enough to be of any interest outside of
the fun I had playing with it. Here's a graph of my model compared to Met's and
Keenan's.

The black line is the actual temperature trend. Red is Met's
AR(1) model. Purple is Keenan's ARIMA(3,1,0) model, and green is my
ARIMA(0,1,1) model. As you can see, my model tracked Keenan's almost
identically and was not significantly different in performance. I ran the 2001
to 2012 predictive test on my model and, while I got a linear trend similar to
Keenan's, mine failed to predict the peaks and troughs because it lacked
autoregression to learn patterns from the past.
Which Model Won?
Forgetting about my uninteresting model we're going to come full
circle in the central debate – the question of which model is the better
for determining the statistical significance in answer to Parliament's
question?
Rather than putting my dog in the fight, I'll let Met answer
the question. The Met opens their response to Her Majesty's Government with the
following:
This
briefing paper has been produced to provide background information relating to
analyses undertaken in response to a series of Parliamentary Questions on the
use of statistical models to assess the global temperature record, and to address misleading ideas currently
appearing in various online discussions.
The emboldened text above obviously refers to Keenan (he who
shall not be named). But in a breathtakingly contradictory statement, Met
concludes Keenan's critique of their choice of the AR(1) model holds merit:
The
results show that the linear trend model with first-order autoregressive noise
[Met's model] is less likely to emulate the global surface temperature
timeseries than the driftless third-order autoregressive integrated model
[Keenan's model]. ... This provides some
evidence against the use of a linear trend model with first-order
autoregressive noise [Met's model] for
the purpose of emulating the statistical properties of instrumental records of
global average temperatures, as would be expected from physical
understanding of the climate system.
Oops! Talk about an awkward moment. Then in a face saving
move, the Met closes with this amazing statement:
These
results have no bearing on our understanding of the climate system or of its
response to human influences such as greenhouse gas emissions and so the Met Office does not base its assessment of
climate change over the instrumental record on the use of these statistical
models.
Translation: "We don't
need no stinking statistical models to determine statistical significance. We
can determine it by eyeballing the data and you'll just have to take our word
at it."
What Met has done was to state to Parliament that the rise
in global temperature of 0.8 degrees Celsius since 1880 is statistically
significant but Keenan is right in his criticism that the model they use is
useless in making that determination. But no worries, they don't need
statistical models to determine statistical significance.
Do you have any idea how incredibly wrong that sounds to a
statistician? Statistical significance can only be determined by using
statistical tests. That's why the term "statistically" is in front of
"significant." That was the whole point of Parliament's question.
[Edit June 14, 2013]: My work was brought to the attention of Lord Bernard Donoughue, UK Parliament Labor Peer, who spearheaded the investigation into
the MET Office inquiry. He offered a kind response to my article.
This article is an updated reprint of my original work of the same title published at Suyts Space, an on-line science and politics venue.