Data Science is a Creative Profession

About a month or so back I had a long telephonic conversation with this guy who runs an offshored analytics/data science company in Bangalore. Like most other companies that are being built in the field of analytics, this follows the software services model – a large team in an offshored location, providing long-term standardised data science solutions to a client in a different “geography”.

As is usual with conversations like this one, we talked about our respective areas of work and kind of projects we take on, and soon we got to the usual bit in such conversations where we were trying to “find synergies”. Things were going swimmingly when this guy remarked that it was the first time he was coming across a freelancer in this profession. “I’ve heard of freelance designers and writers, but never freelance data scientists or analytics professionals”, he mentioned.

In a separate event I was talking to one old friend about another old friend who has set up a one-man company to do provide what is basically freelance consulting services. We reasoned that the reason this guy had set up a company rather than calling himself a freelancer given the reputation that “freelancers” (irrespective of the work they do) have – if you say you are a freelancer people think of someone smoking pot and working in a coffee shop on a Mac. If you say you are a partner or founder of a company, people imagine someone more corporate.

Now that the digression is out of the way let us get back to my conversation with the guy who runs the offshored shop. During the conversation I didn’t say much, just saying things like “what is wrong with being a freelancer in this profession”. But now that i think more about it, it is simply a function of the profession being a fundamentally creative profession.

For a large number of people, data science is simply about statistics, or “machine learning” or predictive modelling – it is about being given a problem expressed in statistical terms and finding the best possible model and model parameters for it. It is about being given a statistical problem and finding a statistical solution – I’m not saying, of course, that statistical modelling is not a creative profession – there is a fair bit of creativity involved in figuring out what kind of model to model, and picking the right model for the right data. But when you have a large team working on the problem, working effectively like an assembly line (with different people handling different parts of the solution), what you get is effectively an “assembly line solution”.

Coming back, let us look at this “a day in the life” post I wrote about a year back about a particular day in office for me. I’ve detailed in that the various kinds of problems I had to solve that day – hidden markov models and bayesian probability to writing code using dynamic programming and implementing the code in R, and then translating the solution back to the business context. Notice that when I started off working on the problem it was not known what domain the problem belonged in – it took some poking and prodding around in order to figure out the nature of the problem and the first step in solution.

And then on, it was one step leading to another, and there are two important facts to consider about each step – firstly, at each step, it wasn’t clear as to what the best class of technique was to get beyond the step – it was about exploration in order to figure out the best class of technique. Next, at no point in time was it known what the next step was going to be until the current step was solved. You can see that it is hard to do it in an assembly line fashion!

Now, you can talk about it being like a game of chess where you aren’t sure what the opponent will do, but then in chess the opponent is a rational human being, while here the “opponent” is basically the data and the patterns it shows, and there is no way to know until you try something as to how the data will react to that. So it is impossible to list out all steps beforehand and solve it – solution is an exploratory process.

And since solving a “data science problem” (as I define it, of course) is an exploratory, and thus creative, process, it is important to work in an atmosphere that fosters creativity and “thinking without thinking” (basically keep a problem in the back of your mind and then take your mind off it, and distract yourself to solve the problem). This is best done away from a traditional corporate environment – where you have to attend meetings and be liable to be disturbed by colleagues at all times, and this is why a freelance model is actually ideal! A small partnership also works – while you might find it hard to “assembly line” the problem, having someone to bounce thoughts and ideas with can have a positive impact to the creative process. Anything more like a corporate structure and you are removing the conditions necessary to foster creativity, and are in such situations more likely to come up with cookie-cutter solutions.

So unless your business model deals with doing repeatable and continuous analytical work for a client, you are better off organising yourselves in an environment that fosters creativity and not a traditional office kind of structure if you want to solve problems using data science. Then again, your mileage might vary!

Datapukes and Dashboards

Avinash Kaushik has put out an excellent, if long, blog post on building dashboards. A key point he makes is about the difference between dashboards and what he calls “datapukes” (while the name is quite self-explanatory and graphic, it basically refers to a report with a lot of data and little insight). He goes on in the blog post to explain how dashboards need to be tailored for recipients at different levels in the organisation, and the common mistakes people make about building a one-size fits all dashboard (most likely to be a dashboard).

Kaushik explains that the higher up you go in an organisation’s hierarchy, the lesser access to data the managers have and they also have lesser time to look into and digest data before they come to a decision – they want the first level of interpretation to have been done for them so that they can proceed to the action. In this context, Kaushik explains that dashboards for top management should be “action-oriented” in that they clearly show the way forward. Such dashboards need to be annotated, he says, with reasoning provided as to why the numbers are in a certain way, and what the company needs to do to take care of it.

Going by Kaushik’s blog post, a dashboard is something that definitely requires human input – it requires an intelligent human to look at and analyse the data, analyse the reasons behind why the data looks a particular way, and then intelligently try and figure out how the top management is likely to use this data, and thus prepare a dashboard.

Now, notice how this requirement of an intelligent human in preparing each dashboard conflicts with the dashboard solutions that a lot of so-called analytics or BI (for Business Intelligence) companies offer – which are basically automated reports with multiple tabs which the manager has to navigate in order to find useful information – in other words, they are datapukes!

Let us take a small digression – when you are at a business lunch, what kind of lunch do you prefer? Given three choices – a la carte, buffet and set menu, which one would you prefer? Assuming the kind of food across the three is broadly the same, there is reason to prefer a set menu over the other two options – at a business lunch you want to maximise the time you spend talking and doing business. Given that the lunch is incidental, it is best if you don’t waste any time or energy getting it (or ordering it)!

It is a similar case with dashboards for top management. While a datapuke might give a much broader insight, and give the manager opportunity to drill down, such luxuries are usually not necessary for a time-starved CXO – all he wants are the distilled insights with a view towards what needs to be done. It is very unlikely that such a person will have the time or inclination to drill down -which can anyway be made possible via an attached data puke.

It will be interesting what will happen to the BI and dashboarding industry once more companies figure out that what they want are insightful dashboards and not mere data pukes. With the requirement of an intelligent human to make these “real” dashboards (he is essentially a business analyst), will these BI companies respond by putting dedicated analysts for each of their clients? Or will we see a new layer of service providers (who might call themselves “management consultants”) who take in the datapukes and use their human intelligence to provide proper dashboards? Or will we find artificial intelligence building the dashboards?

It will be very interesting to watch this space!

 

Analytics and complexity

I recently learnt that a number of people think that the more the number of variables you use in your model, the better your model is! What has surprised me is that I’ve met a lot of people who think so, and recommendations for simple models haven’t been taken too kindly.

The conversation usually goes like this

“so what variables have you considered for your analysis of ______ ?”
“A,B,C”
“Why don’t you consider D,E,F,… X,Y,Z also? These variables matter for these reasons. You should keep all of them and build a more complete model”
“Well I considered them but they were not significant so my model didn’t pick them up”
“No but I think your model is too simplistic if it uses only three variables”

This is a conversation i’ve had with so many people that i wonder what kind of conceptions people have about analytics. Now I wonder if this is because of the difference in the way I communicate compared to other “analytics professionals”.

When you do analytics, there are two ways to communicate – to simplify and to complicate (for lack of a better word). Based on my experience, what I find is that a majority of analytics professionals and modelers prefer to complicate – they talk about complicated statistical techniques they use for solving the problem (usually with fancy names) and bulldoze the counterparty into thinking they are indeed doing something hi-funda.

The other approach, followed by (in my opinion) a smaller number of people, is to simplify. You try and explain your model in simple terms that the counterparty will understand. So if your final model contains only three explanatory variables, you tell them that only three variables are used, and you show how each of these variables (and combinations thereof) contribute to the model. You draw analogies to models the counterparty can appreciate, and use that to explain.

Now, like analytics professionals can be divided into two kinds (as above), I think consumers of analytics can also be divided into two kinds. There are those that like to understand the model, and those that simply want to get into the insights. The former are better served by the complicating type analytics professionals, and the latter by the simplifying type. The other two combinations lead to disaster.

Like a good management consultant, I represent this problem using the following two-by-two:

analytics2by2

 As a principle, I like to explain models in a simplified fashion, so that the consumer can completely understand it and use it in a way he sees appropriate. The more pragmatic among you, however, can take a guess on what type the consumer is and tweak your communication accordingly.

 

 

Black Box Models

A few years ago, Felix Salmon wrote this article in Wired called “The Formula That Killed Wall Street“. It was about a formula called “Gaussian Copula”, which was a formula for estimating the joint probability of a set of events happening, if you knew the individual probabilities. It was a mathematical breakthrough.

Unfortunately, it fell into the hands of quants and traders who didn’t fully understand it, and they used it to derive joint probabilities of a large number of instruments put together. What they did not realize was that there was an error in the model (as there is in all models), and when they used the formula to tie up a large number of instruments, this error cascaded, resulting in an extremely inaccurate model, and subsequent massive losses (the last paragraph is based on my reading of the situation. Your mileage might vary).

In a blog post earlier this week at Reuters, Salmon returned to this article. He said:

 And you can’t take technology further than its natural limits, either. It wasn’t really the Gaussian copula function which killed Wall Street, nor was it the quants who wielded it. Rather, it was the quants’ managers — the people whose limited understanding of copula functions and value-at-risk calculations allowed far too much risk to be pushed out into the tails. On Wall Street, just as in the rest of industry, a little bit of common sense can go a very long way.

I’m completely with him on this one. This blog post was in reference to Salmon’s latest article in Wired, which is about the four stages in which quants disrupt industries. You are encouraged to read both the Wired article and the blog post about it.

The essence is that it is easy to over-do analytics. Once you have a model that works in a few cases, you will end up putting too much faith into the model, and soon the model will become gospel, and you will build the rest of the organization around the model (this is Stage Three that Salmon talks about). For example, a friend who is a management consultant once mentioned about how bank lending practices are now increasingly formula driven. He mentioned reading a manager’s report that said “I know the applicant well, and am confident that he will repay the loan. However, our scoring system ranks him too low, hence I’m unable to offer the loan“.

The key issue, as Salmon mentions in his blog post, is that managers need to have at least a basic understanding of analytics (I had touched upon this issue in an earlier blog post). As I had written in that blog post, there can be two ways in which the analytics team can end up not contributing to the firm – firstly, people think they are geeks who nobody understands, and ignores them. Secondly, and perhaps more dangerously, people think of the analytics guys as gods, and fail to challenge them sufficiently, thus putting too much faith in models.

From this perspective, it is important for the analytics team to communicate well with the other managers – to explain the basic logic behind the models, so that the managers can understand the assumptions and limitations, and can use the models in the intended manner. What usually happens, though, is that after a few attempts when management doesn’t “get” the models, the analytics people resign themselves to using technical jargon and three letter acronyms to bulldoze their models past the managers.

The point of this post, however, is about black box models. Sometimes, you can have people (either analytics professionals or managers) using models without fully understanding them, and their assumptions. This inevitably leads to disaster. A good example of this are the traders and quants who used David Li’s Gaussian Copula, and ended up with horribly wrong models.

In order to prevent this, a good practice would be for the analytics people to be able to explain the model in an intuitive fashion (without using jargon) to the managers, so that they all understand the essence and nuances of the model in question. This, of course, means that you need to employ analytics people who are capable of effectively communicating their ideas, and employ managers who are able to at least understand some basic quant.

On finding the right signal

It is not necessary that every problem yields a “signal”. It is well possible that sometimes you try to solve a problem using data and you are simply unable to find any signal. This does not mean that you have failed in your quest – the fact that you have found the absence of a signal means is valuable information and needs to be appreciated.

Sometimes, however, clients and consumers of analytics fail to appreciate this. In their opinion, if you fail to find an answer to a particular problem, you as an analyst have failed in your quest. They think that with a better analyst or better analysis it is possible to get a superior signal.

This failure by consumers of analytics to appreciate that sometimes there need not be a signal can sometimes lead to fudging. Let us say you have a data set where there is a very weak signal – let us say that all your explanatory variables explain about 1% of the variance in the dependent variable. In most cases (unless you are trading – in which case a 1% signal has some value), there is little value to be gleaned from this, and you are better off without applying a model. However, the fact that the client may not appreciate you if you give “no” as an answer can lead you to propose this 1% explanatory model as truth.

What one needs to recognize is that a bad model can sometimes subtract value. One of my clients once was using this model that had been put in place by an earlier consultant. This model had prescribed certain criteria they had to follow in recruitment, and I was asked to take a look at it. What I found was that the model showed absolutely no “signal” – based on my analysis, people with a high score as per that model were no more likely to do better than those that scored low based on that model!

You might ask what the problem with such a model is. The problem is that by recommending a certain set of scores on a certain set of parameters, the model was filtering out a large number of candidates, and without any basis. Thus, using a poor model, the company was trying to recruit out of a much smaller pool, which led to lesser choice for the hiring managers which led to suboptimal decisions. I remember closing that case with a recommendation to dismantle the model (since it wasn’t giving much of a signal anyway) and to instead simply empower the hiring manager!

Essentially companies need to recognize two things. Firstly, not having a model is better than having a poor model, for a poor model can subtract value and lead to suboptimal decision-making. Secondly, not every problem has a quantitative solution. It is very well possible that there is absolutely no signal in the data. So if no signal exists, the analyst is not at fault if she doesn’t find a signal! In fact, she would be dishonest if she were to report a signal when none existed!

It is important that companies keep these two things in mind while hiring a consultant to solve a problem using data.

Missed opportunities in cross-selling

Talk to any analytics or “business intelligence” provider – be it a large commoditized outsourcing firm or a rather niche consultant – and one thing they all claim to advise their clients on is strategies for “cross sell”. However, my personal experience suggests that implementation of cross-sell strategies among retailers I encounter is extremely poor. I will illustrate two such examples in this post here.

Jet Airways and American Express together have come up with this “Jet Airways American Express Platinum Credit Card”. Like any other co-branded credit card, it offers you additional benefits on Jet Airways flights booked with this card (in terms of higher points) as well as some other benefits such as lounge access for economy travel. Given that I’m a consultant and travel frequently, this is something I think is good to have, and have attempted to purchase it a few times. And got discouraged by the purchase process each time and backed out.

Now, I’m a customer of both Jet Airways and American Express. I hold an American Express Gold Card (perhaps one of the few people to have an individual AmEx card), and have a Jet Privilege account. Yet, neither Jet or Amex seems remotely interested in selling to me. I once remember applying for this card through the Amex call centre. The person at the other end of the line wanted me to fill up the entire form once again – despite me being already a cardholder. This I would ascribe to messed up incentive structures where the salesperson at the other end gets higher benefits for acquiring a new customer rather than upgrading an existing one. I’ve mentioned I want this card to the Amex call centre several times, yet no one has called me back.

However, these are not the missed cross-sell opportunities I’m talking about in this post. Three times in the last three months (maybe more, but I cannot recollect) I’ve booked an air ticket to fly on Jet airways from the Jet Airways website having logged into my Jet Privilege account and paying with my American Express card. Each time I’ve waited hopefully that some system at either the Jet or the Amex end will make the connection and offer me this Platinum card, but so far there has been response. It is perhaps the case that for some reason they do not want to upgrade existing customers to this card (in which case the entire discussion is moot) but not offering me a card here is simply a case of a blatant missed opportunity – in cricketing terms you can think of this as an easy dropped catch.

The other case has to do with banking. I’m in the process of purchasing a house, and over the last few months have been transferring large amounts of money to the seller in order to make my down payments (which I’m meeting through my savings). Now, I’ve had my account with Citibank for over seven years and have never withdrew such large amounts – except maybe to make some fixed deposits. One time, I got a call from the bank’s call centre, confirming if it was indeed I who had made the transfer. Why did the bank not think of finding out (in a discreet manner) why all of a sudden so much money had moved out of my account, and if I was up to purchasing something and if the bank could help? Of course, later, during a visit to the Citibank local branch recently I found I wouldn’t have got a loan from them anyway since they don’t finance apartments built by no-name builders that are still under construction (which fits the bill of the property I’m purchasing). Nevertheless – the large money transferred out of my account could have been for buying a property that the bank could have financed. Missed opportunity there?

My understanding of the situation is that in several “analytics” offerings there is a disconnect between the tech and the business sides. Somewhere along the chain of implementation there is one hand-off where one party knows only the business aspects and the other knows only technology, and thus the two are unable to converse, leading to suboptimal decisions. One kind of value I offer (hint! hint!!) is that I understand both tech and business, and I can ensure a much smoother hand-off between the technical and business aspects, thus leading to superior solution design.

Should you have an analytics team?

In an earlier post, I had talked about the importance of business people knowing numbers and numbers people knowing business, and had put in a small advertisement for my consulting services by mentioning that I know both business and numbers and work at their cusp. In this post, I take that further and analyze if it makes sense to have a dedicated analytics team.

Following the data boom, most companies have decided (rightly) that they need to do something to take advantage of all the data that they have and have created dedicated analytics teams. These teams, normally staffed with people from a quantitative or statistical background, with perhaps a few MBAs, is in charge of taking care of all the data the company has along with doing some rudimentary analysis. The question is if having such dedicated teams is effective or if it is better to have numbers-enabled people across the firm.

Having an analytics team makes sense from the point of view of economies of scale. People who are conversant with numbers are hard to come by, and when you find some, it makes sense to put them together and get them to work exclusively on numerical problems. That also ensures collaboration and knowledge sharing and that can have positive externalities.

Then, there is the data aspect. Anyone doing business analytics within a firm needs access to data from all over the firm, and if the firm doesn’t have a centralized data warehouse which houses all its data, one task of each analytics person would be to get together the data that they need for their analysis. Here again, the economies of scale of having an integrated analytics team work. The job of putting together data from multiple parts of the firm is not solved multiple times, and thus the analysts can spend more time on analyzing rather than collecting data.

So far so good. However, writing a while back I had explained that investment banks’ policies of having exclusive quant teams have doomed them to long-term failure. My contention there (including an insider view) was that an exclusive quant team whose only job is to model and which doesn’t have a view of the market can quickly get insular, and can lead to groupthink. People are more likely to solve for problems as defined by their models rather than problems posed by the market. This, I had mentioned can soon lead to a disconnect between the bank’s models and the markets, and ultimately lead to trading losses.

Extending that argument, it works the same way with non-banking firms as well. When you put together a group of numbers people and call them the analytics group, and only give them the job of building models rather than looking at actual business issues, they are likely to get similarly insular and opaque. While initially they might do well, soon they start getting disconnected from the actual business the firm is doing, and soon fall in love with their models. Soon, like the quants at big investment banks, they too will start solving for their models rather than for the actual business, and that prevents the rest of the firm from getting the best out of them.

Then there is the jargon. You say “I fitted a multinomial logistic regression and it gave me a p-value of 0.05 so this model is correct”, the business manager without much clue of numbers can be bulldozed into submission. By talking a language which most of the firm understands you are obscuring yourself, which leads to two responses from the rest. Either they deem the analytics team to be incapable (since they fail to talk the language of business, in which case the purpose of existence of the analytics team may be lost), or they assume the analytics team to be fundamentally superior (thanks to the obscurity in the language), in which case there is the risk of incorrect and possibly inappropriate models being adopted.

I can think of several solutions for this – but irrespective of what solution you ultimately adopt –  whether you go completely centralized or completely distributed or a hybrid like above – the key step in getting the best out of your analytics is to have your senior and senior-middle management team conversant with numbers. By that I don’t mean that they all go for a course in statistics. What I mean is that your middle and senior management should know how to solve problems using numbers. When they see data, they should have the ability to ask the right kind of questions. Irrespective of how the analytics team is placed, as long as you ask them the right kind of questions, you are likely to benefit from their work (assuming basic levels of competence of course). This way, they can remain conversant with the analytics people, and a middle ground can be established so that insights from numbers can actually flow into business.

So here is the plug for this post – shortly I’ll be launching short (1-day) workshops for middle and senior level managers in analytics. Keep watching this space :)