Data Science is a Creative Profession

About a month or so back I had a long telephonic conversation with this guy who runs an offshored analytics/data science company in Bangalore. Like most other companies that are being built in the field of analytics, this follows the software services model – a large team in an offshored location, providing long-term standardised data science solutions to a client in a different “geography”.

As is usual with conversations like this one, we talked about our respective areas of work and kind of projects we take on, and soon we got to the usual bit in such conversations where we were trying to “find synergies”. Things were going swimmingly when this guy remarked that it was the first time he was coming across a freelancer in this profession. “I’ve heard of freelance designers and writers, but never freelance data scientists or analytics professionals”, he mentioned.

In a separate event I was talking to one old friend about another old friend who has set up a one-man company to do provide what is basically freelance consulting services. We reasoned that the reason this guy had set up a company rather than calling himself a freelancer given the reputation that “freelancers” (irrespective of the work they do) have – if you say you are a freelancer people think of someone smoking pot and working in a coffee shop on a Mac. If you say you are a partner or founder of a company, people imagine someone more corporate.

Now that the digression is out of the way let us get back to my conversation with the guy who runs the offshored shop. During the conversation I didn’t say much, just saying things like “what is wrong with being a freelancer in this profession”. But now that i think more about it, it is simply a function of the profession being a fundamentally creative profession.

For a large number of people, data science is simply about statistics, or “machine learning” or predictive modelling – it is about being given a problem expressed in statistical terms and finding the best possible model and model parameters for it. It is about being given a statistical problem and finding a statistical solution – I’m not saying, of course, that statistical modelling is not a creative profession – there is a fair bit of creativity involved in figuring out what kind of model to model, and picking the right model for the right data. But when you have a large team working on the problem, working effectively like an assembly line (with different people handling different parts of the solution), what you get is effectively an “assembly line solution”.

Coming back, let us look at this “a day in the life” post I wrote about a year back about a particular day in office for me. I’ve detailed in that the various kinds of problems I had to solve that day – hidden markov models and bayesian probability to writing code using dynamic programming and implementing the code in R, and then translating the solution back to the business context. Notice that when I started off working on the problem it was not known what domain the problem belonged in – it took some poking and prodding around in order to figure out the nature of the problem and the first step in solution.

And then on, it was one step leading to another, and there are two important facts to consider about each step – firstly, at each step, it wasn’t clear as to what the best class of technique was to get beyond the step – it was about exploration in order to figure out the best class of technique. Next, at no point in time was it known what the next step was going to be until the current step was solved. You can see that it is hard to do it in an assembly line fashion!

Now, you can talk about it being like a game of chess where you aren’t sure what the opponent will do, but then in chess the opponent is a rational human being, while here the “opponent” is basically the data and the patterns it shows, and there is no way to know until you try something as to how the data will react to that. So it is impossible to list out all steps beforehand and solve it – solution is an exploratory process.

And since solving a “data science problem” (as I define it, of course) is an exploratory, and thus creative, process, it is important to work in an atmosphere that fosters creativity and “thinking without thinking” (basically keep a problem in the back of your mind and then take your mind off it, and distract yourself to solve the problem). This is best done away from a traditional corporate environment – where you have to attend meetings and be liable to be disturbed by colleagues at all times, and this is why a freelance model is actually ideal! A small partnership also works – while you might find it hard to “assembly line” the problem, having someone to bounce thoughts and ideas with can have a positive impact to the creative process. Anything more like a corporate structure and you are removing the conditions necessary to foster creativity, and are in such situations more likely to come up with cookie-cutter solutions.

So unless your business model deals with doing repeatable and continuous analytical work for a client, you are better off organising yourselves in an environment that fosters creativity and not a traditional office kind of structure if you want to solve problems using data science. Then again, your mileage might vary!

Datapukes and Dashboards

Avinash Kaushik has put out an excellent, if long, blog post on building dashboards. A key point he makes is about the difference between dashboards and what he calls “datapukes” (while the name is quite self-explanatory and graphic, it basically refers to a report with a lot of data and little insight). He goes on in the blog post to explain how dashboards need to be tailored for recipients at different levels in the organisation, and the common mistakes people make about building a one-size fits all dashboard (most likely to be a dashboard).

Kaushik explains that the higher up you go in an organisation’s hierarchy, the lesser access to data the managers have and they also have lesser time to look into and digest data before they come to a decision – they want the first level of interpretation to have been done for them so that they can proceed to the action. In this context, Kaushik explains that dashboards for top management should be “action-oriented” in that they clearly show the way forward. Such dashboards need to be annotated, he says, with reasoning provided as to why the numbers are in a certain way, and what the company needs to do to take care of it.

Going by Kaushik’s blog post, a dashboard is something that definitely requires human input – it requires an intelligent human to look at and analyse the data, analyse the reasons behind why the data looks a particular way, and then intelligently try and figure out how the top management is likely to use this data, and thus prepare a dashboard.

Now, notice how this requirement of an intelligent human in preparing each dashboard conflicts with the dashboard solutions that a lot of so-called analytics or BI (for Business Intelligence) companies offer – which are basically automated reports with multiple tabs which the manager has to navigate in order to find useful information – in other words, they are datapukes!

Let us take a small digression – when you are at a business lunch, what kind of lunch do you prefer? Given three choices – a la carte, buffet and set menu, which one would you prefer? Assuming the kind of food across the three is broadly the same, there is reason to prefer a set menu over the other two options – at a business lunch you want to maximise the time you spend talking and doing business. Given that the lunch is incidental, it is best if you don’t waste any time or energy getting it (or ordering it)!

It is a similar case with dashboards for top management. While a datapuke might give a much broader insight, and give the manager opportunity to drill down, such luxuries are usually not necessary for a time-starved CXO – all he wants are the distilled insights with a view towards what needs to be done. It is very unlikely that such a person will have the time or inclination to drill down -which can anyway be made possible via an attached data puke.

It will be interesting what will happen to the BI and dashboarding industry once more companies figure out that what they want are insightful dashboards and not mere data pukes. With the requirement of an intelligent human to make these “real” dashboards (he is essentially a business analyst), will these BI companies respond by putting dedicated analysts for each of their clients? Or will we see a new layer of service providers (who might call themselves “management consultants”) who take in the datapukes and use their human intelligence to provide proper dashboards? Or will we find artificial intelligence building the dashboards?

It will be very interesting to watch this space!

 

The most unique single malt

There might have been a time in life when you would’ve had some Single Malt whisky and thought that it “doesn’t taste like any other”. In fact, you might have noticed that some single malt whiskies are more distinct than others. It is possible you might want to go on a quest to find the most unique single malts, but given that single malts are expensive and not easily available, some data analysis might help.

There is this dataset of 86 single malts that has been floating about the interwebs for a while now, and there is some simple yet interesting analysis related to that data – for example, check out this simple analysis with a K-means clustering of various single malts. They use the dataset (which scores each of the 86 malts on 12 different axis) in order to cluster the malts, and analyze which whiskies belong to similar  groups.

Continue reading “The most unique single malt”

The Signficicance of Statistical Significance

Last year, an aunt was diagnosed with extremely low bone density. She had been complaining of back pain and weakness, and a few tests later, her orthopedic confirmed that bone density was the problem. She was put on a course of medication, and then was given by shots. A year later, she got her bone density tested again, and found that there was not much improvement.

She did a few rounds of the doctors again – orthopedics, endocrinologists and the like, and the first few were puzzled that the medication and the shots had had no effect. One of the doctors, though, saw something others didn’t – “there is no marked improvement, for sure”, he remarked, “but there is definitely some improvement”.

Let us say you take ten thousand observations in “state A”, and another ten thousand in “state B”. The average of your observations in state A is 100, and the standard deviation is 10. The average of your observations in state B is 101, and the standard deviation is 10. Is there a significant difference between the observations in the two states?

Statistically speaking, there most definitely is (with 10000 samples, the “standard error” given a standard deviation of 10 is 0.1 (10 / sqrt(10000) ), and the two sets of observations are ten standard errors apart which means that the difference between them is “statistically significant” to a high degree of significance. The question, however, is if the difference is actually “significant” (in the non-statistical sense of the word).

Think about it from the context of drug testing. Let us say that we are testing a drug for increasing bone density among people with low bone density (like my aunt). Let’s say we catch 10000 mice and measure their bone densities. Let’s say the average is 100, with a standard deviation of 10.

Now, let us inject our drug (in the appropriate dosage – scaled down from man to mouse) on our mice, and after they’ve undergone the requisite treatment, measure the bone densities again. Let’s say that the average is now 101, with a standard deviation of 10. Based on this test, can we conclude that our drug is effective for improving bone density?

What cannot be denied is that one course of medication among mice produces results that are statistically significant – there is an increase in bone density among mice that cannot be explained by randomness alone. From this perspective, the drug is undoubtedly effective – that there is a positive effect from taking the drug is extremely highly likely.

However, does this mean that we use this drug for treating low bone density? Despite the statistical significance, the answer to this is not very clear. Let us for a moment assume that there are no competitors – there is no other known drug which can increase a patient’s bone density by a statistically significant amount. So the choice is this – we either not use any drug, leading to no improvement in the patient (let us assume that another experiment has shown that in the absence of drugging, there is no change in bone density) or we use this drug, which produces a small but statistically significant improvement. What do we do?

The question we need to answer here is whether the magnitude of improvement on account of taking this drug is worth the cost (monetary cost, possible side effects, etc.) of taking the drug. Do we want to put the patient through the trouble of taking the medication when we know that the difference it will make, though statistically significant, is marginal? It is a fuzzy question, and doesn’t necessarily have a clear answer.

In summary, the basic point is that a statistically significant improvement does not mean that the difference is significant in terms of magnitude. With samples large enough, even small changes can be statistically significant, and we need to be cognizant of that.

Postscript
No mice were harmed in the course of writing this blog post

Exponential need not mean explosive

Earlier on this blog I’ve written about the misuse of the term “exponential” when it is used to describe explosive increase in a particular number. My suspicion is that this misuse of the word “exponential” comes from Computer Science and complexity theory – where the hardest problems to crack are those which require time/space that is exponential in the size of the data. In fact, the entire definition of P, NP and NP-completeness have to do with the distinction between problems that take exponential resources versus those that take resources that are a polynomial function of the size of data.

Earlier today, I shared this blog post by Bryan Caplan on Puerto Rican immigration into the United States with a comment “exponential immigration”. I won’t rule out drawing some flak for this particular description, for Caplan’s thesis is that Puerto Rican immigration took a long time indeed to “explode”. However, I would expect that the flak I get for describing this variable as “exponential” would come from people who mistake “exponential” for “explosive”.

Caplan’s theory in the above linked blog post is that immigration from Puerto Rico to the United states was extremely slow for a very long time. It was in the late 1890s that a US Supreme Court ruling allowed free access to Puerto Ricans to the United States. However, it took close to a hundred years for this immigration to “explode”. Caplan’s theory is that the number of people moving to the US per year is a function of the number of Puerto Ricans who are already there!

In other words, the immigration process can be described by our favourite equation: dX/dt = kX, solving which we will get an equation of the form X = a exp(kt), which means that the growth is indeed exponential in time! Yet, given a rather small value of X_0 (the number of Puerto Ricans in the United States at the time the law was passed), and given a small value of k, the increase has been anything but explosive, despite it being exponential!

The point of this post is worth reiterating: the word “exponential”, in its common use, has been taken to be synonymous with “explosive”, and this is wrong. Exponential growth need not be explosive, and explosive growth need not be exponential! The two concepts are unrelated and people would do well to not confuse one with the other.

 

 

 

Standard deviation is over

I first learnt about the concept of Standard Deviation sometime in 1999, when we were being taught introductory statistics in class 12. It was classified under the topic of “measures of dispersion”, and after having learnt the concepts of “mean deviation from median” (and learning that “mean deviation from mean” is identically zero) and “mean absolute deviation”, the teacher slipped in the concept of the standard deviation.

I remember being taught the mnemonic of “railway mail service” to remember that the standard deviation was “root mean square” (RMS! get it?). Calculating the standard deviation was simple. You took the difference between each data point and the average, and then it was “root mean square” – you squared the numbers, took the arithmetic mean and then square root.

Back then, nobody bothered to tell us why the standard deviation was significant. Later in engineering, someone (wrongly) told us that you square the deviations so that you can account for negative numbers (if that were true, the MAD would be equally serviceable). A few years later, learning statistics at business school, we were told (rightly this time) that the standard deviation was significant because it doubly penalized outliers. A few days later, we learnt hypothesis testing, which used the bell curve. “Two standard deviations includes 95% of the data”, we learnt, and blindly applied to all data sets – problems we encountered in examinations only dealt with data sets that were actually normally distributed. It was much later that we figured that the number six in “six sigma” was literally pulled out of thin air, as a dedication to Sigma Six, a precursor of Pink Floyd.

Somewhere along the way, we learnt that the specialty of the normal distribution is that it can be uniquely described by mean and standard deviation. One look at the formula for its PDF tells you why it is so:

Most introductory stats lessons are taught from the point of view of using stats to do science. In the natural world, and in science, a lot of things are normally distributed (hence it is the “normal” distribution). Thus, learning statistics using the normal distribution as a framework is helpful if you seek to use it to do science. The problem arises, however, if you assume that everything is normally distributed, as a lot of people do when they learn deep statistics using the normal distribution.

When you step outside the realms of natural science, however, you are in trouble if you were to blindly use the standard deviation, and consequently, the normal distribution. For in such realms, the normal distribution is seldom normal. Take, for example, stock markets. Most popular financial models assume that the movement of the stock price is either normal or log-normal (the famous Black-Scholes equation uses the latter assumption). In certain regimes, they might be reasonable assumptions, but pretty much anyone who has reasonably followed the markets knows that stock price movements have “fat tails”, and thus the lognormal assumption is not a great example.

At least the stock price movement looks somewhat normal (apart from the fat tails). What if you are doing some social science research and are looking at, for example, data on people’s incomes? Do you think it makes sense at all to define standard deviation for income of a sample of people? Going further, do you think it makes sense at all to compare the dispersion in incomes across two populations by measuring the standard deviations of incomes in each?

I was once talking to an organization which was trying to measure and influence salesperson efficiency. In order to do this, again, they were looking at mean and standard deviation. Given that the sales of one salesperson can be an order of magnitude greater than that of another (given the nature of their product), this made absolutely no sense!

The problem with the emphasis on standard deviation in our education means that most people know one way to measure dispersion. When you know one method to measure something, you are likely to apply it irrespective of whether it is the appropriate method to use given the circumstances. It leads to the proverbial hammer-nail problem.

What we need to understand is that the standard deviation makes sense only for some kinds of data. Yes, it is mathematically defined for any set of numbers, but it makes physical sense only when the data is approximately normally distributed. When data doesn’t fit such a distribution (and more often than not it doesn’t), the standard deviation makes little sense!

For those that noticed, the title of this post is a dedication to Tyler Cowen’s recent book.

Calibration and test sets

When you’re doing any statistical analysis, the standard thing to do is to divide your data into “calibration” and “test” data sets. You build the model on the “calibration” data set, and then test it on the “test” data set. The purpose of this slightly complicated procedure is so that you don’t “overfit” your model.

Overfitting is the process where in your attempt to find a superior model, you build a model that is too tailored to your data, and when you apply it to a different data set, it can fail spectacularly. By setting aside some of your data as a “test” data set, you make sure that the model that you built is not too calibrated to the data use used to calibrate it to.

Now, there are several methods in which you can divide your data into “calibration” and “test” data sets. One method is to use a random number generator, and randomly divide the data into two parts – typically the calibration data set is about three times as big as the test data set (this is the rule I normally use, but there is no sanctity to this). The problem with this method, however, is that if you are building a model based on data collected at different points in time, any systematic change in behaviour over time cannot be captured by the model, and it loses predictive value. Let me explain.

Let us say that we are collecting some data over time. What data it is doesn’t matter, but essentially we are trying to use a set of variables to predict the value of another variable. Let us say that the relationship between the predictor variables and the predicted variable changes over time.

Now, if we were to build a model where we randomly divide data into calibration and test sets, the model will build will be something that will take into account the different regimes. The relationship between the predictor and predicted variables in the calibration data set is likely to be identical to the relationship between the predictor and predicted variables in the test data set – since both have been sampled uniformly across time. While that might be good, the problem is that this kind of a model has little predictive value.

Another way of splitting your data into calibration and test period is by splitting it over time. Rather than using a random number generator to split data into calibration and test parts, we simply use time. We can say that the data collected in the first 3/4th of the time period (in which we’ve collected the data) forms the calibration set, and the last 1/4th forms the test set. A model tested on this kind of calibration and test data is a stronger model, for it has predictive value!

In real life, if you have to predict a variable in the future, all you have at your disposal is a model that is calibrated on past data. Thus, you need a model that works across time. And in order to make sure you model can work across time, what you need to do is to split your data into calibration and test sets across time – that way you can check that model built with data from one time period can indeed work on data from a following time period!

Finally, how can you check if there is a “regime change” in the relationship between the predictor and predicted variables? We can use the difference in splitting data into calibration and test sets!

Firstly, split the data into calibration and test sets randomly. Find out how well the model explains the data in the test set. Next, split the data into calibration and test sets by time. Now find out how well the model explains the data in the test set. If there is not much difference in the performance of the model on the test set in these two cases, it means that there is no “regime change”. If there is a significant difference between the performance of the two models, it means that there is a definite regime change. Moreover, the extent of regime change can be evaluated based on the difference in goodness of fit in the two cases.

Analytics and complexity

I recently learnt that a number of people think that the more the number of variables you use in your model, the better your model is! What has surprised me is that I’ve met a lot of people who think so, and recommendations for simple models haven’t been taken too kindly.

The conversation usually goes like this

“so what variables have you considered for your analysis of ______ ?”
“A,B,C”
“Why don’t you consider D,E,F,… X,Y,Z also? These variables matter for these reasons. You should keep all of them and build a more complete model”
“Well I considered them but they were not significant so my model didn’t pick them up”
“No but I think your model is too simplistic if it uses only three variables”

This is a conversation i’ve had with so many people that i wonder what kind of conceptions people have about analytics. Now I wonder if this is because of the difference in the way I communicate compared to other “analytics professionals”.

When you do analytics, there are two ways to communicate – to simplify and to complicate (for lack of a better word). Based on my experience, what I find is that a majority of analytics professionals and modelers prefer to complicate – they talk about complicated statistical techniques they use for solving the problem (usually with fancy names) and bulldoze the counterparty into thinking they are indeed doing something hi-funda.

The other approach, followed by (in my opinion) a smaller number of people, is to simplify. You try and explain your model in simple terms that the counterparty will understand. So if your final model contains only three explanatory variables, you tell them that only three variables are used, and you show how each of these variables (and combinations thereof) contribute to the model. You draw analogies to models the counterparty can appreciate, and use that to explain.

Now, like analytics professionals can be divided into two kinds (as above), I think consumers of analytics can also be divided into two kinds. There are those that like to understand the model, and those that simply want to get into the insights. The former are better served by the complicating type analytics professionals, and the latter by the simplifying type. The other two combinations lead to disaster.

Like a good management consultant, I represent this problem using the following two-by-two:

analytics2by2

 As a principle, I like to explain models in a simplified fashion, so that the consumer can completely understand it and use it in a way he sees appropriate. The more pragmatic among you, however, can take a guess on what type the consumer is and tweak your communication accordingly.

 

 

Black Box Models

A few years ago, Felix Salmon wrote this article in Wired called “The Formula That Killed Wall Street“. It was about a formula called “Gaussian Copula”, which was a formula for estimating the joint probability of a set of events happening, if you knew the individual probabilities. It was a mathematical breakthrough.

Unfortunately, it fell into the hands of quants and traders who didn’t fully understand it, and they used it to derive joint probabilities of a large number of instruments put together. What they did not realize was that there was an error in the model (as there is in all models), and when they used the formula to tie up a large number of instruments, this error cascaded, resulting in an extremely inaccurate model, and subsequent massive losses (the last paragraph is based on my reading of the situation. Your mileage might vary).

In a blog post earlier this week at Reuters, Salmon returned to this article. He said:

 And you can’t take technology further than its natural limits, either. It wasn’t really the Gaussian copula function which killed Wall Street, nor was it the quants who wielded it. Rather, it was the quants’ managers — the people whose limited understanding of copula functions and value-at-risk calculations allowed far too much risk to be pushed out into the tails. On Wall Street, just as in the rest of industry, a little bit of common sense can go a very long way.

I’m completely with him on this one. This blog post was in reference to Salmon’s latest article in Wired, which is about the four stages in which quants disrupt industries. You are encouraged to read both the Wired article and the blog post about it.

The essence is that it is easy to over-do analytics. Once you have a model that works in a few cases, you will end up putting too much faith into the model, and soon the model will become gospel, and you will build the rest of the organization around the model (this is Stage Three that Salmon talks about). For example, a friend who is a management consultant once mentioned about how bank lending practices are now increasingly formula driven. He mentioned reading a manager’s report that said “I know the applicant well, and am confident that he will repay the loan. However, our scoring system ranks him too low, hence I’m unable to offer the loan“.

The key issue, as Salmon mentions in his blog post, is that managers need to have at least a basic understanding of analytics (I had touched upon this issue in an earlier blog post). As I had written in that blog post, there can be two ways in which the analytics team can end up not contributing to the firm – firstly, people think they are geeks who nobody understands, and ignores them. Secondly, and perhaps more dangerously, people think of the analytics guys as gods, and fail to challenge them sufficiently, thus putting too much faith in models.

From this perspective, it is important for the analytics team to communicate well with the other managers – to explain the basic logic behind the models, so that the managers can understand the assumptions and limitations, and can use the models in the intended manner. What usually happens, though, is that after a few attempts when management doesn’t “get” the models, the analytics people resign themselves to using technical jargon and three letter acronyms to bulldoze their models past the managers.

The point of this post, however, is about black box models. Sometimes, you can have people (either analytics professionals or managers) using models without fully understanding them, and their assumptions. This inevitably leads to disaster. A good example of this are the traders and quants who used David Li’s Gaussian Copula, and ended up with horribly wrong models.

In order to prevent this, a good practice would be for the analytics people to be able to explain the model in an intuitive fashion (without using jargon) to the managers, so that they all understand the essence and nuances of the model in question. This, of course, means that you need to employ analytics people who are capable of effectively communicating their ideas, and employ managers who are able to at least understand some basic quant.

On finding the right signal

It is not necessary that every problem yields a “signal”. It is well possible that sometimes you try to solve a problem using data and you are simply unable to find any signal. This does not mean that you have failed in your quest – the fact that you have found the absence of a signal means is valuable information and needs to be appreciated.

Sometimes, however, clients and consumers of analytics fail to appreciate this. In their opinion, if you fail to find an answer to a particular problem, you as an analyst have failed in your quest. They think that with a better analyst or better analysis it is possible to get a superior signal.

This failure by consumers of analytics to appreciate that sometimes there need not be a signal can sometimes lead to fudging. Let us say you have a data set where there is a very weak signal – let us say that all your explanatory variables explain about 1% of the variance in the dependent variable. In most cases (unless you are trading – in which case a 1% signal has some value), there is little value to be gleaned from this, and you are better off without applying a model. However, the fact that the client may not appreciate you if you give “no” as an answer can lead you to propose this 1% explanatory model as truth.

What one needs to recognize is that a bad model can sometimes subtract value. One of my clients once was using this model that had been put in place by an earlier consultant. This model had prescribed certain criteria they had to follow in recruitment, and I was asked to take a look at it. What I found was that the model showed absolutely no “signal” – based on my analysis, people with a high score as per that model were no more likely to do better than those that scored low based on that model!

You might ask what the problem with such a model is. The problem is that by recommending a certain set of scores on a certain set of parameters, the model was filtering out a large number of candidates, and without any basis. Thus, using a poor model, the company was trying to recruit out of a much smaller pool, which led to lesser choice for the hiring managers which led to suboptimal decisions. I remember closing that case with a recommendation to dismantle the model (since it wasn’t giving much of a signal anyway) and to instead simply empower the hiring manager!

Essentially companies need to recognize two things. Firstly, not having a model is better than having a poor model, for a poor model can subtract value and lead to suboptimal decision-making. Secondly, not every problem has a quantitative solution. It is very well possible that there is absolutely no signal in the data. So if no signal exists, the analyst is not at fault if she doesn’t find a signal! In fact, she would be dishonest if she were to report a signal when none existed!

It is important that companies keep these two things in mind while hiring a consultant to solve a problem using data.