Meaningful and meaningless variables (and correlations)

A number of data scientists I know like to go about their business in a domain-free manner. They make a conscious choice to not know anything about the domain in which they are solving the problem, and instead treat a dataset as just a set of anonymised data, and attack it with the usual methods.

I used to be like this as well a long time ago. I remember in my very first job I had pissed off some clients by claiming that “I don’t care if this is a nut or a screw. As far as I’m concerned this is just a part number”.

Over time, though, I’ve come to realise that even a little bit of domain knowledge or intuition can help build significantly superior models. To use a framework I had introduced a few months back, your domain knowledge can be used to restrict the degrees of freedom in your model, thus increasing how much the machine can learn with the available data.

Then again, some problems lend themselves better to domain-based intuition than others, and this has to do with the meaning of a data point.

Consider two fairly popular problem statements from data science – determining whether a borrower will pay back a loan, and determining whether there is a cat in a given picture. While at the surface level, both are binary decisions, to be made by looking at large dimensional data (the number of data points that can be used for credit scoring can be immense), there is an important distinction between the two problems.

In the cat picture case, a single data point is basically the colour of a single pixel in an image, and it doesn’t really mean anything. If we were to try and build a cat recognition algorithm based on a single pre-chosen pixel in an image, it is unlikely we can do better than noise. Instead, the information is encoded in groups of pixels near each other – a bunch of pixels that look like cat ears, for example. In this case, whether you are training to model to identify cats or cinnamon buns is immaterial, and the domain-free approach works well.

With the credit scoring problem, the amount of information in each explanatory variable is significant. Unless we are looking at some extremely esoteric or insignificant variables (trust me, these get used fairly often in credit scoring models), it is possible to build a decision model based on just one explanatory variable and still have significant predictive power. There is definitely information in correlation between explanatory variables, but that pales compared to the information in the variables themselves.

And the amount of information captured by each explanatory variable means that it makes sense in these cases to invest some human effort to understand the variables and the impact it is having. In some cases, you might decide to use a mathematical transformation of a variable (square or log or inverse) instead of the variable itself. In other cases, you might determine based on logic that some correlations are spurious and drop the variables altogether. You might see a few explanatory variables with largely similar information and decide to drop some of them or use dimension reduction algorithms. And you can do a much better job of this if you have some experience or intuition about the domain, and care to understand what each variable means. Because variables have meanings.

Unlike in the image recognition problem, where most of the intuition is in the correlation term, because of which the “variables” don’t have any meaning, where domain doesn’t matter that much (though it can – in that some kinds of algorithms are superior at some kinds of images. I don’t have much experience in this domain to comment 🙂 ).

Again like in all the two-by-twos that I produce (and there are many, though this is arguably the most famous one), the problem is where you take people from one side and put them in a situation from the other side.

If you come from a background where you’ve mostly dealt with datasets where each individual variable is meaningless, but there is information in the collective, you are likely to “stir the pile” rather than using intuition to build better models.

If you are used to dealing with datasets with “meaning”, where variables hold the information, you might waste time doing your jiggery-pokery when you should be looking to apply models that get information in the collective.

The problem is this is a rather esoteric classification, so there is plenty of chance for people to be thrown into the wrong end.