Calibration and test sets

When you’re doing any statistical analysis, the standard thing to do is to divide your data into “calibration” and “test” data sets. You build the model on the “calibration” data set, and then test it on the “test” data set. The purpose of this slightly complicated procedure is so that you don’t “overfit” your model.

Overfitting is the process where in your attempt to find a superior model, you build a model that is too tailored to your data, and when you apply it to a different data set, it can fail spectacularly. By setting aside some of your data as a “test” data set, you make sure that the model that you built is not too calibrated to the data use used to calibrate it to.

Now, there are several methods in which you can divide your data into “calibration” and “test” data sets. One method is to use a random number generator, and randomly divide the data into two parts – typically the calibration data set is about three times as big as the test data set (this is the rule I normally use, but there is no sanctity to this). The problem with this method, however, is that if you are building a model based on data collected at different points in time, any systematic change in behaviour over time cannot be captured by the model, and it loses predictive value. Let me explain.

Let us say that we are collecting some data over time. What data it is doesn’t matter, but essentially we are trying to use a set of variables to predict the value of another variable. Let us say that the relationship between the predictor variables and the predicted variable changes over time.

Now, if we were to build a model where we randomly divide data into calibration and test sets, the model will build will be something that will take into account the different regimes. The relationship between the predictor and predicted variables in the calibration data set is likely to be identical to the relationship between the predictor and predicted variables in the test data set – since both have been sampled uniformly across time. While that might be good, the problem is that this kind of a model has little predictive value.

Another way of splitting your data into calibration and test period is by splitting it over time. Rather than using a random number generator to split data into calibration and test parts, we simply use time. We can say that the data collected in the first 3/4th of the time period (in which we’ve collected the data) forms the calibration set, and the last 1/4th forms the test set. A model tested on this kind of calibration and test data is a stronger model, for it has predictive value!

In real life, if you have to predict a variable in the future, all you have at your disposal is a model that is calibrated on past data. Thus, you need a model that works across time. And in order to make sure you model can work across time, what you need to do is to split your data into calibration and test sets across time – that way you can check that model built with data from one time period can indeed work on data from a following time period!

Finally, how can you check if there is a “regime change” in the relationship between the predictor and predicted variables? We can use the difference in splitting data into calibration and test sets!

Firstly, split the data into calibration and test sets randomly. Find out how well the model explains the data in the test set. Next, split the data into calibration and test sets by time. Now find out how well the model explains the data in the test set. If there is not much difference in the performance of the model on the test set in these two cases, it means that there is no “regime change”. If there is a significant difference between the performance of the two models, it means that there is a definite regime change. Moreover, the extent of regime change can be evaluated based on the difference in goodness of fit in the two cases.

Put Comment