The Battle of the Models - Keenan vs. Met

Jun 09, 2013


(A  Validation of Dr. Keenan's Model)

The latest contestants to enter the climate debate stage are mathematician Douglas J. Keenan and the UK's Meteorological Office or Met for short. On November 8, 2012, Lord Donoughue, a member of the UK Parliament posed the question:

To ask Her Majesty's Government ... whether they consider a rise in global temperature of 0.8 degrees Celsius since 1880 to be significant. [HL3050].

The Met Office answered, "the temperature rise since about 1880 is statistically significant," meaning that the temperature rise since 1880 is attributable to human caused global warming.

Keenan challenged the Met Office's reply to Parliament on the basis that the model chosen by Met's leading statistician, Dr. Julia Slingo, was not suited to answer the question. Keenan offered up his own choice of model that he claims was much better suited. As you might imagine, a series back and forth opinion editorials and commentaries ensued which sparked a rather heated debate.


Douglas Keenan – Editor of


Much has already been written about the tone of the debate, Met's failure to provide specifics on their claims of statistical significance, and of Keenan's brazen indictment of Met's Chief Executive Officer, John Hirst and leading scientist Dr. Slingo. It is the ruckus between Keenan and the Met that has captured most people's interest and is the basis of most articles on the matter.

Being a numbers guy, I'm less interested in the tone of the debate and more interested in the models that the Met and Keenan espouse. I'll lay out the strengths and weaknesses of the two models and offer a closing assessment of which model is more suited to determining the statistical significance of the temperature rise of 0.8 degrees C since 1880.

Statistical significance is a somewhat nebulous term when it comes to non-stationary time series data such as the temperature record. Most measurements of processes in nature will show statistically significant results if the time series is long enough simply because Mother Nature does many cyclical or curvilinear things in big ways. The following graphic illustrates this.


The Pacific Decadal Oscillation is a natural cycle with significant trends


In medical research, statistical significance is usually determined by comparing a control group to an experimental group. However, in climate research there is no control group, making the testing for statistical significance a more mathematically tangled affair, which sparks much disagreement among scientists on how to best do it.

The central debate between Keenan and the Met Office revolves around which model provides a better "goodness of fit" (GOF) to the temperature record and has the best predictive skills from which to infer statistical significance. Rather than focus on the question of statistical significance in post-industrial temperatures, which I'll leave to Met and Keenan to debate, this article takes an objective look at the two models they espouse and why one model is better suited to determine statistical significance.

The Met uses the AR(1) model. They claim their model shows statistical significance in current temperature trends, inferring recent temperature trends are being caused by human activities. Keenan's charge is that AR(1) is not robust enough to make such determination. He espouses the ARIMA(3,1,0) model which he claims is more robust. He claims his model shows today's temperature trends are not statistically significant and more likely caused by natural variability.


Decomposing The Models


Both models begin with the letters AR, meaning that both models employ Auto-Regression, a stochastic process that weights the sum of previous measurements in the time series and white noise error to produce a predictive output for the next measurement. In the case of Met's AR(1) model, the one (1) means it is a "first-order autoregression" model. In practical terms, first order means the temperature from January of last year has a direct impact on January's temperature for this year but the temperature from January two years ago or earlier has no direct relationship to this year.

The use of the AR(1) model is premised on the assumption that changes in temperatures have only first order relationships as described above. That's a rather bold and unjustified assumption in my opinion. There are known natural cycles and events, lasting from a few years to many decades, which affect the earth's temperature. The notion that temperatures more than one year back can have some effect on today's temperature seems intuitive to me. This is one of Keenan's criticisms of the AR(1) model – a criticism that I agree with.

The ARIMA(3,1,0) model proposed by Keenan is parameterized to use third-order autoregression as indicated by the value three (3). Keenan's model assumes the three previous temperature measurements and the white noise term contribute to the current temperature measurement. If the sum of the three previous temperature measurements is positive, the output will resemble a low pass filter with the high frequency part of the white noise decreased in amplitude. We'll see that low pass filtering effect when we start looking at comparative graphs of AR(1) vs. ARIMA(3,1,0) model output.

There are three parts to the model: "AR" (AutoRegression), which has been discussed above. The "I" (Integrative) term indicates that the model applies an integrative function to change the data from non-stationary to stationary structure to make it more suitable for autoregression. Keenan's model applies one (1) order for differencing. The "MA" (Moving Average) term indicates the model applies a moving average to its output. Because Keenan's value is zero (0), no moving average is applied to the model's output.

Keenan's model is flexible and more adaptive but considerably more complex. This is the chief criticism of Met regarding Keenan's choice of model. I'll be exploring that criticism shortly.


The Battle of the Met's AR(1) and Keenan's ARIMA(3,1,0) Models


To determine "the better model," there are a number of objective measures we can look at to assess the model's parameterization, goodness of fit to observed temperatures, and predictive skills. In Keenan's supplementary material, he explains that he tested the Met's AR(1) and his preferred ARIMA(3,1,0) models against the GISS temperature series. In Met's response to Parliament of Keenan's criticisms, they used the HadCRUT4 temperature series.

I personally don't think which temperature series is used to test the models matters. If it does, then we have other systematic issues to deal with that go beyond the choice of models. For the purpose of this article, I'll use the latest GISS GHCNv3 temperature series to compare the two models. All of my statistical analysis is done using the statistical package R with the "Forecast" library – a package designed for AR and ARIMA modeling.


Let the Analysis Begin!


Test for Statistical Significance – I performed appropriate tests to determine if the results I was getting were, themselves, statistically significant and could be relied upon. Again, I'm not testing the statistical significance of the temperature series but rather I'm testing the reliability of the tests I'm performing on the models. This is standard practice in mathematics. In most cases an ANOVA test was performed. In all cases, I tested at the 95% Confidence Interval (alpha = 0.05, p < 0.05). The tests I performed on both models were as follows:

Sigma^2 - When we compare the fit of the models to the data, we look for the smaller standard deviation of the residual (error) of the model. The residual represents the portion of the data the model can't explain. By "can't explain," I mean the model fails to account for. The lower the residual, the better the model fits to the data and explains it. The standard deviation of the residual is called the Sigma^2.


Akaike Information Criterion test with correction (AICc) – AICc is a method of testing a given model from a set of models. The best model is one that minimizes the Kullback-Leibler distance between the model and the data. Kullback-Leibler distance measures the difference between the probability distributions of the model as compared to the actual temperature series distribution. The model that has the best GOF (or least divergence between the two distributions) is the better model.


Bayesian Information Criterion (BIC) – The chief criticism Met levies against Keenan's choice of model is the greater number of parameters used in his chosen model makes it unnecessarily complicated and prone to error.  The BIC test is similar to the AICc test but penalizes the model for over-parameterization (being unnecessarily complicated for the data under analysis). It is a better test when comparing models with differing orders of parameterization as in our testing.


Coefficient of Determination (R2) and Residual (1-R2) - The R2 measure determines what percentage of the variance in the trend produced by the model output can be explained by the model. It is a measure of how well the model is designed and parameterized to account for the variance and trend in the data. The residual of the coefficient, 1-R2, tells us what percentage of the variance in the model trend is due to residual (error) and is not accounted for by the model. The higher the R2 and the lower the 1-R2, the better the model is designed and parameterized.


F Ratio - The F ratio is part of the ANOVA test of the model's performance. It tells us the ratio of the variability of the trend explained by the model to the unexplained variability of the trend, each divided by the degrees of freedom (df) of autoregression used by the model. The higher the F ratio, the more useful the model is to analyze the temperature data.


Lag Correlations – The lag correlations tell us how well the parameterizations of the models perform their autocorrelation to the time series data. There are two objective measures of lag correlation: ACF (residual AutoCorrelation Function) and PACF (residual Partial AutoCorrelation Function).  Collectively, these two measures determine how well the parameters to be used for the "AR", "I", and "MA" functions perform for the respective models. They can also be used to test for the best parameters to be used. ACF and PACF are best analyzed visually in graph form.

I could write a book about how to interpret ACF and PACF. So, to keep it simple we'll look at one important aspect of the above graphs. Met's AR(1) model is on the left side with ACF at the top and PACF at the bottom. Keenan's ARIMA(3,1,0) model is on the right. Notice that each graph has dotted blue lines. This is referred to as the cutoff band. When a model's parameterization is optimum, its lag values will remain inside the cutoff. A special note is in order here. In R, a Lag 0 line is drawn for the ACF graph that extends well outside of the cutoff lines. This is a well-known defect in the ACF graphics function so it can be ignored.

Note that Met's AR(1) model makes excursions outside of the cutoff lines. Keenan's ARIMA(3,1,0) model's lag values remain inside the cutoff lines, indicating that Keenan's model autocorrelates better to the time series data.


Diebold-Mariano Test – This tests the null hypothesis that the two models perform equally as well against the same time series data.  The test compares the residuals (error) of each model. The results of the Diebold-Marino test are as follows:

DM = -2.4648

Forecast horizon = 1

Loss function power = 2

p-value = 0.01499

alternative hypothesis: two.sided

The Diebold-Marino test is statistically significant (p = 0.015), meaning we reject the null hypothesis that both models perform equally. The DM value is negative, indicating that Keenan's ARIMA(3,1,0) model performed better than Met's AR(1).


The Box-Ljung Test for Residual Dependency – The Box-Ljung test is a diagnostic tool used to test the null hypothesis that the model does not exhibit a lack of fit to the time series. The test is applied to the residuals of a time series after fitting an ARIMA(p, d, q) model to the data. The test examines autocorrelations of the residuals for dependency. If the autocorrelations are very small, we conclude that the model does not exhibit significant lack of fit.

This is somewhat inverse logic and a little bit difficult to follow. Ideally, you want the model to fail the Box-Ljung test (have a p-value > 0.05) supporting the null hypothesis that the observed fit in the model is due to the model's predictive skills (I'm oversimplifying here but hopefully you mathy people get the point).


The Proof Is In the Tasting of the Pudding


So far, I've thrown out a bunch of model tests and made my best attempt to explain what they mean. If you've reached this place in the article, congratulations, you managed to stay awake, LOL! For those who found the above boring, stick around. It's going to get more fun from here.

The proof is in the tasting of the pudding.  Both the Met's AR(1) and Keenan's ARIMA(3,1,0) models are time series predictive models, meaning they are mathematical crystal balls that can tell the future, given some hint of the past.

This is what the models render when we plug all of the temperature data into them:



The actual GISS GHCNv3 temperature trend is the black line in the graphic. Met's AR(1) model output is in red. Keenan's ARIMA(3,1,0) model output is in green. If you study this graph closely, you might notice something odd towards the current timeline in the graph. Here, let me show you:



One of the criticisms of the AR(1) model is it's ability to follow a non-stationary trend falls off as the time series gets longer. Above, you can see that up until 1960, the AR(1) model tracks the actual temperature trend fairly well. However, as we approach the year 1990 and forward, it starts to drift negative in relation to the actual trend. It is a "drifting" model whereas Keenan's ARIMA(3,1,0) is driftless.

Earlier in this article, I mentioned that the ARIMA(3,1,0) model uses third order auto-correlation. In mostly positive or mostly negative trends it tends to act somewhat as a low pass filter, meaning it favors the lower frequencies of the trend and tends to be less reactive to the high frequency variability components of the trend. You can see that the green line of the ARIMA(3,1,0) model is less variable than the AR(1) line. The key performance indicator of these models is how well they predict the trend and not so much how well they account for high frequency components in the trend.

Now, we're going to see how well the two models serve as crystal balls. In this next test, I removed the years 2002 through 2012 from the GISS GHCNv3 temperature series. I intentionally selected 2002 as my truncation date because the years from 1997 through the end of 2001 provide a "hint" that the rate of global warming was beginning to stagnate. I wanted both models to have equal opportunity to get the hint.

The following two graphics overlay the model's predictions over the actual temperature trend which includes the 2002 through 2012 actual temperature anomalies.


First up, Keenan's ARIMA(3,1,0) model prediction for the years 2002 through 2012:



And now, Met's AR(1) model's prediction of temperatures from 2002 to 2012:



In both graphics, the dark blue lines are the model's temperature prediction. The light blue bands around the prediction lines are the 95% Confidence Interval. The still wider gray bands are the 90% confidence Interval.

It is obvious that Keenan's ARIMA(3,1,0) model took the hint I provided and modeled the temperatures from 2002 to 2012 with an accuracy of close to 0.05 deg. C. Met's AR(1) model completely missed the hint and assumed that temperatures would follow the momentary down trend that was occurring at the truncation of the temperature series.

An interesting surprise to me was that Keenan's ARIMA(3,1,0) model actually predicted the two peaks and one trough in temperature trends during the prediction period, albeit somewhat muted as expected. You can see the two small peaks and single trough in the dark blue line in the ARIMA(3,1,0) model prediction. Well done!


Is There a Better Model?


A model can always be improved upon. We hear that from climatologists all the time, right? Well, actually it's true. Referring back to my discussion of ACF and PACF, I played around with the ARIMA model class and found a set of parameters that performed slightly better than Keenan's ARIMA(3,1,0) model. My model was the ARMIA(0,1,1).

When I say slightly better, I mean my ARIMA(0,1,1) model failed several tests for significance in performance against Keenan's ARIMA(3,1,0) model, meaning that while some of my key performance metrics were ever so slightly better, they weren't better enough to be of any interest outside of the fun I had playing with it. Here's a graph of my model compared to Met's and Keenan's.



The black line is the actual temperature trend. Red is Met's AR(1) model. Purple is Keenan's ARIMA(3,1,0) model, and green is my ARIMA(0,1,1) model. As you can see, my model tracked Keenan's almost identically and was not significantly different in performance. I ran the 2001 to 2012 predictive test on my model and, while I got a linear trend similar to Keenan's, mine failed to predict the peaks and troughs because it lacked autoregression to learn patterns from the past.


Which Model Won?


Forgetting about my uninteresting model we're going to come full circle in the central debate – the question of which model is the better for determining the statistical significance in answer to Parliament's question?

Rather than putting my dog in the fight, I'll let Met answer the question. The Met opens their response to Her Majesty's Government with the following:

This briefing paper has been produced to provide background information relating to analyses undertaken in response to a series of Parliamentary Questions on the use of statistical models to assess the global temperature record, and to address misleading ideas currently appearing in various online discussions.

The emboldened text above obviously refers to Keenan (he who shall not be named). But in a breathtakingly contradictory statement, Met concludes Keenan's critique of their choice of the AR(1) model holds merit:

The results show that the linear trend model with first-order autoregressive noise [Met's model] is less likely to emulate the global surface temperature timeseries than the driftless third-order autoregressive integrated model [Keenan's model]. ... This provides some evidence against the use of a linear trend model with first-order autoregressive noise [Met's model] for the purpose of emulating the statistical properties of instrumental records of global average temperatures, as would be expected from physical understanding of the climate system.

Oops! Talk about an awkward moment. Then in a face saving move, the Met closes with this amazing statement:

These results have no bearing on our understanding of the climate system or of its response to human influences such as greenhouse gas emissions and so the Met Office does not base its assessment of climate change over the instrumental record on the use of these statistical models.

Translation: "We don't need no stinking statistical models to determine statistical significance. We can determine it by eyeballing the data and you'll just have to take our word at it."

What Met has done was to state to Parliament that the rise in global temperature of 0.8 degrees Celsius since 1880 is statistically significant but Keenan is right in his criticism that the model they use is useless in making that determination. But no worries, they don't need statistical models to determine statistical significance.

Do you have any idea how incredibly wrong that sounds to a statistician? Statistical significance can only be determined by using statistical tests. That's why the term "statistically" is in front of "significant." That was the whole point of Parliament's question.

[Edit June 14, 2013]: My work was brought to the attention of Lord Bernard Donoughue, UK Parliament Labor Peer, who spearheaded the investigation into the MET Office inquiry. He offered a kind response to my article.

Lord Donoughue response

This article is an updated reprint of my original work of the same title published at Suyts Space, an on-line science and politics venue.

blog comments powered by Disqus