analytics – Page 3 – Pertinent Observations

Medium stats

So Medium sends me this email:

Congratulations! You are among the top 10% of readers and writers on Medium this year. As a small thank you, we’ve put together some highlights from your 2016.

Now, I hardly use Medium. I’ve maybe written one post there (a long time ago) and read only a little bit (blogs I really like I’ve put on RSS and read on Feedly). So when Medium tells me that I, who considers myself a light user, is “in the top 10%”, they’re really giving away the fact that the quality of usage on their site is pretty bad.

Sometimes it’s bloody easy to see through flattery! People need to be more careful on what the stats they’re putting out really convey!

Restaurants, deliveries and data

Delivery aggregators are moving customer data away from the retailer, who now has less knowledge about his customer.

Ever since data collection and analysis became cheap (with cloud-based on-demand web servers and MapReduce), there have been attempts to collect as much data as possible and use it to do better business. I must admit to being part of this racket, too, as I try to convince potential clients to hire me so that I can tell them what to do with their data and how.

And one of the more popular areas where people have been trying to use data is in getting to “know their customer”. This is not a particularly new exercise – supermarkets, for example, have been offering loyalty cards so that they can correlate purchases across visits and get to know you better (as part of a consulting assignment, I once sat with my clients looking at a few supermarket bills. It was incredible how much we humans could infer about the customers by looking at those bills).

The recent tradition (after it has become possible to analyse large amounts of data) is to capture “loyalties” across several stores or brands, so that affinities can be tracked across them and customer can be understood better. Given data privacy issues, this has typically been done by third party agents, who then sell back the insights to the companies whose data they collect. An early example of this is Payback, which links activities on your ICICI Bank account with other products (telecom providers, retailers, etc.) to gain superior insights on what you are like.

Nowadays, with cookie farming on the web, this is more common, and you have sites that track your web cookies to figure out correlations between your activities, and thus infer your lifestyle, so that better advertisements can be targeted at you.

In the last two or three years, significant investments have been made by restaurants and retailers to install devices to get to know their customers better. Traditional retailers are being fitted with point-of-sale devices (provision of these devices is a highly fragmented market). Restaurants are trying to introduce loyalty schemes (again a highly fragmented market). This is all an attempt to better get to know the customer. Except that middlemen are ruining it.

I’ve written a fair bit on middleman apps such as Grofers or Swiggy. They are basically delivery apps, which pick up goods for you from a store and deliver it to your place. A useful service, though as I suggest in my posts linked above, probably overvalued. As the share of a restaurant or store’s business goes to such intermediaries, though, there is another threat to the restaurant – lack of customer data.

When Grofers buys my groceries from my nearby store, it is unlikely to tell the store who it is buying for. Similarly when Swiggy buys my food from a restaurant. This means loyalty schemes of these sellers will go for a toss. Of course not offering the same loyalty program to delivery companies is a no-brainer. But what the sellers are also missing out on is the customer data that they would have otherwise captured (had they sold directly to the customer).

A good thing about Grofers or Swiggy is that they’ve hit the market at a time when sellers are yet to fully realise the benefits of capturing customer data, so they may be able to capture such data for cheap, and maybe sell it back to their seller clients. Yet, if you are a retailer who is selling to such aggregators and you value your customer data, make sure you get your pound of flesh from these guys.

On Uppi2’s top rating

So it appears that my former neighbour Upendra’s new magnum opus Uppi2 is currently the top rated movie on IMDB, with a rating of 9.7/10.0. The Times of India is so surprised that it has done an entire story about it, which I’ve screenshot here:

The story also mentions that another Kannada movie RangiTaranga (which I’ve reviewed here) is in third spot, with a rating of 9.4 out of 10. This might lead you to wonder why Kannada movies have suddenly turned out to be so good. The answer, however, lies in simple logic.

The first is that both are relatively new movies and hence their ratings suffer from “small sample bias”. Of course, the sample isn’t that small – Uppi2 has received 1900 votes, which is 3 times as much as its 1999 prequel Upendra. Yet, it being a new movie, only a subset of the small set of people who have watched it so far would have reviewed it.

The second is selection bias. The people who see a movie in its first week are usually the hardcore fans, and in this case it is hardcore fans of Upendra’s movies. And hardcore fans usually find it hard to have their belief shaken (a version of what I’ve written about online opinions for Mint here), and hence they all give the movie a high rating.

As time goes by, and people who are not as hardcore fans of Upendra start watching and reviewing the movie, the ratings are likely to rationalise. Finally, ratings are easy to rig, especially when samples are small. For example, an Upendra fan club might have decided to play up the movie online by voting en masse on IMDB, and pushing up its ratings. This might explain both why the movie already has 1900 ratings in four days, and most of them are extremely positive.

The solution for this is for the rating system (IMDB in this case) to pay more weightage for “verified ratings” (by people who have rated more movies in the past, for instance), or remove highly correlated ratings. Right now, the rating algorithm seems pretty naive.

Coming back to Uppi2, from what I’ve heard from people, the movie is supposed to be really good, though perhaps not 9.7 good. I plan to watch the movie in the next few days and will write a review once I do so.

Meanwhile, read this absolutely brilliant review (in Kannada) written by this guy called “Jogi”

The Ramayana and the Mahabharata principles

An army of monkeys can’t win you a complex war like the Mahabharata. For that you need a clever charioteer.

A business development meeting didn’t go well. The potential client indicated his preference for a different kind of organisation to solve his problem. I was about to say “why would you go for an army of monkeys to solve this problem when you can.. ” but I couldn’t think of a clever end to the sentence. So I ended up not saying it.

Later on I was thinking of the line and good ways to end it. The mind went back to Hindu mythology. The Ramayana war was won with an army of monkeys, of course. The Mahabharata war was won with the support of a clever and skilled consultant (Krishna didn’t actually fight the war, did he?). “Why would you go for an army of monkeys to solve this problem when you can hire a studmax charioteer”, I phrased. Still doesn’t have that ring. But it’s a useful concept anyway.

Extending the analogy, the Ramayana was was different from the Mahabharata war. In the former, the enemy was a ten-headed demon who had abducted the hero’s wife. Despite what alternate retellings say, it was all mostly black and white. A simple war made complex with the special prowess of the enemy (ten heads, special weaponry, etc.). The army of monkeys proved decisive, and the war was won.

The Mahabharata war was, on the other hand, much more complex. Even mainstream retellings talk about the “shades of grey” in the war, and both sides had their share of pluses and minuses. The enemy here was a bunch of cousins, who had snatched away the protagonists’ kingdom. Special weaponry existed on both sides. Sheer brute force, however, wouldn’t do. The Mahabharata war couldn’t be won with an army of monkeys. Its complexity meant it needed was skilled strategic guidance, and a bit of cunning, which is what Krishna provided when he was hired by Arjuna ostensibly as a charioteer. Krishna’s entire army (highly trained and skilled, but footsoldiers mostly) fought on opposite side, but couldn’t influence the outcome.

So when the problem at hand is simple, and the only complexity is in size or volume or complexity of the enemy, you will do well to hire an army of monkeys. They’ll work best for you there. But when faced with a complex situation and complexity that goes well beyond the enemy’s prowess, you need a charioteer. So make the choice based on the kind of problem you are facing.

Datapukes and Dashboards

Avinash Kaushik has put out an excellent, if long, blog post on building dashboards. A key point he makes is about the difference between dashboards and what he calls “datapukes” (while the name is quite self-explanatory and graphic, it basically refers to a report with a lot of data and little insight). He goes on in the blog post to explain how dashboards need to be tailored for recipients at different levels in the organisation, and the common mistakes people make about building a one-size fits all dashboard (most likely to be a dashboard).

Kaushik explains that the higher up you go in an organisation’s hierarchy, the lesser access to data the managers have and they also have lesser time to look into and digest data before they come to a decision – they want the first level of interpretation to have been done for them so that they can proceed to the action. In this context, Kaushik explains that dashboards for top management should be “action-oriented” in that they clearly show the way forward. Such dashboards need to be annotated, he says, with reasoning provided as to why the numbers are in a certain way, and what the company needs to do to take care of it.

Going by Kaushik’s blog post, a dashboard is something that definitely requires human input – it requires an intelligent human to look at and analyse the data, analyse the reasons behind why the data looks a particular way, and then intelligently try and figure out how the top management is likely to use this data, and thus prepare a dashboard.

Now, notice how this requirement of an intelligent human in preparing each dashboard conflicts with the dashboard solutions that a lot of so-called analytics or BI (for Business Intelligence) companies offer – which are basically automated reports with multiple tabs which the manager has to navigate in order to find useful information – in other words, they are datapukes!

Let us take a small digression – when you are at a business lunch, what kind of lunch do you prefer? Given three choices – a la carte, buffet and set menu, which one would you prefer? Assuming the kind of food across the three is broadly the same, there is reason to prefer a set menu over the other two options – at a business lunch you want to maximise the time you spend talking and doing business. Given that the lunch is incidental, it is best if you don’t waste any time or energy getting it (or ordering it)!

It is a similar case with dashboards for top management. While a datapuke might give a much broader insight, and give the manager opportunity to drill down, such luxuries are usually not necessary for a time-starved CXO – all he wants are the distilled insights with a view towards what needs to be done. It is very unlikely that such a person will have the time or inclination to drill down -which can anyway be made possible via an attached data puke.

It will be interesting what will happen to the BI and dashboarding industry once more companies figure out that what they want are insightful dashboards and not mere data pukes. With the requirement of an intelligent human to make these “real” dashboards (he is essentially a business analyst), will these BI companies respond by putting dedicated analysts for each of their clients? Or will we see a new layer of service providers (who might call themselves “management consultants”) who take in the datapukes and use their human intelligence to provide proper dashboards? Or will we find artificial intelligence building the dashboards?

It will be very interesting to watch this space!

Calibration and test sets

When you’re doing any statistical analysis, the standard thing to do is to divide your data into “calibration” and “test” data sets. You build the model on the “calibration” data set, and then test it on the “test” data set. The purpose of this slightly complicated procedure is so that you don’t “overfit” your model.

Overfitting is the process where in your attempt to find a superior model, you build a model that is too tailored to your data, and when you apply it to a different data set, it can fail spectacularly. By setting aside some of your data as a “test” data set, you make sure that the model that you built is not too calibrated to the data use used to calibrate it to.

Now, there are several methods in which you can divide your data into “calibration” and “test” data sets. One method is to use a random number generator, and randomly divide the data into two parts – typically the calibration data set is about three times as big as the test data set (this is the rule I normally use, but there is no sanctity to this). The problem with this method, however, is that if you are building a model based on data collected at different points in time, any systematic change in behaviour over time cannot be captured by the model, and it loses predictive value. Let me explain.

Let us say that we are collecting some data over time. What data it is doesn’t matter, but essentially we are trying to use a set of variables to predict the value of another variable. Let us say that the relationship between the predictor variables and the predicted variable changes over time.

Now, if we were to build a model where we randomly divide data into calibration and test sets, the model will build will be something that will take into account the different regimes. The relationship between the predictor and predicted variables in the calibration data set is likely to be identical to the relationship between the predictor and predicted variables in the test data set – since both have been sampled uniformly across time. While that might be good, the problem is that this kind of a model has little predictive value.

Another way of splitting your data into calibration and test period is by splitting it over time. Rather than using a random number generator to split data into calibration and test parts, we simply use time. We can say that the data collected in the first 3/4th of the time period (in which we’ve collected the data) forms the calibration set, and the last 1/4th forms the test set. A model tested on this kind of calibration and test data is a stronger model, for it has predictive value!

In real life, if you have to predict a variable in the future, all you have at your disposal is a model that is calibrated on past data. Thus, you need a model that works across time. And in order to make sure you model can work across time, what you need to do is to split your data into calibration and test sets across time – that way you can check that model built with data from one time period can indeed work on data from a following time period!

Finally, how can you check if there is a “regime change” in the relationship between the predictor and predicted variables? We can use the difference in splitting data into calibration and test sets!

Firstly, split the data into calibration and test sets randomly. Find out how well the model explains the data in the test set. Next, split the data into calibration and test sets by time. Now find out how well the model explains the data in the test set. If there is not much difference in the performance of the model on the test set in these two cases, it means that there is no “regime change”. If there is a significant difference between the performance of the two models, it means that there is a definite regime change. Moreover, the extent of regime change can be evaluated based on the difference in goodness of fit in the two cases.

Does facebook think my wife is my ex?

The “lookback” video feature that Facebook has launched on account of its tenth anniversary is nice. It flags up all the statuses and photos that you’ve uploaded that have been popular, and shows you how your life on facebook has been through the years.

My “lookback” video is weird, though, in that it contains content exclusively from my “past life”. There is absolutely no mention of the wife, despite us having been married for over three years now! And it is not like we’ve hidden our marriage from Facebook – we have a large number of photos and statuses in the recent past in which both of us have been mentioned.

Now, the danger with an exercise such as the lookback is that it can dig up unwanted things from one’s past. Let’s say you were seeing someone, the two of you together were all over Facebook and then you broke up. And then when you tried to clean up Facebook and get rid of the remnants of your past life, you miss cleaning up some stuff. And Facebook picks that up and puts that in you lookback video, making it rather unpleasant.

I’m sure the engineers at Facebook would have been aware of this problem, and hence would have come up with an algorithm to prevent such unpleasantness. Some bright engineer there would have come up with a filter such that ex-es are filtered out.

Now, back in January 2010, the (now) wife and I announced that we were in a relationship. Our respective profiles showed the names of the other person, and we proudly showed we were in a relationship. Then in August of the same year, the status changed to “Engaged’, and in November to “Married”. Through this time we we mentioned on each other’s profiles as each other’s significant others.

Then, a year or two back -I’m not sure when, exactly – the wife for some reason decided to remove the fact that she is married from facebook. I don’t think she changed her relationship status, but didn’t make the fact that she’s married public. As a consequence, my relationship status automatically changed from “Married to Priyanka Bharadwaj” to just “Married”.

So, I think facebook has this filter that if someone has once been your significant other, and is not that (according to your Facebook relationship status) anymore, he/she is an ex. And anyone who is your ex shall not appear in your lookback video – it doesn’t matter if you share status updates and photos after your “break up”.

Since Priyanka decided to hide the fact that she’s married from Facebook, facebook possibly thinks that we’ve broken up. The algorithm that created the lookback video would have ignored that we still upload pictures in which both of us are there – probably that algorithm thinks we’ve broken up but are still friends!

So – you have my lookback video which is almost exclusively about my past life (interestingly, most people who appear in the video are IIMB batchmates, and I joined Facebook two years after graduation), and contains nothing of my present!

Algorithms can be weird!

Analytics and complexity

I recently learnt that a number of people think that the more the number of variables you use in your model, the better your model is! What has surprised me is that I’ve met a lot of people who think so, and recommendations for simple models haven’t been taken too kindly.

The conversation usually goes like this

“so what variables have you considered for your analysis of ______ ?”
“A,B,C”
“Why don’t you consider D,E,F,… X,Y,Z also? These variables matter for these reasons. You should keep all of them and build a more complete model”
“Well I considered them but they were not significant so my model didn’t pick them up”
“No but I think your model is too simplistic if it uses only three variables”

This is a conversation i’ve had with so many people that i wonder what kind of conceptions people have about analytics. Now I wonder if this is because of the difference in the way I communicate compared to other “analytics professionals”.

When you do analytics, there are two ways to communicate – to simplify and to complicate (for lack of a better word). Based on my experience, what I find is that a majority of analytics professionals and modelers prefer to complicate – they talk about complicated statistical techniques they use for solving the problem (usually with fancy names) and bulldoze the counterparty into thinking they are indeed doing something hi-funda.

The other approach, followed by (in my opinion) a smaller number of people, is to simplify. You try and explain your model in simple terms that the counterparty will understand. So if your final model contains only three explanatory variables, you tell them that only three variables are used, and you show how each of these variables (and combinations thereof) contribute to the model. You draw analogies to models the counterparty can appreciate, and use that to explain.

Now, like analytics professionals can be divided into two kinds (as above), I think consumers of analytics can also be divided into two kinds. There are those that like to understand the model, and those that simply want to get into the insights. The former are better served by the complicating type analytics professionals, and the latter by the simplifying type. The other two combinations lead to disaster.

Like a good management consultant, I represent this problem using the following two-by-two:

As a principle, I like to explain models in a simplified fashion, so that the consumer can completely understand it and use it in a way he sees appropriate. The more pragmatic among you, however, can take a guess on what type the consumer is and tweak your communication accordingly.

Black Box Models

A few years ago, Felix Salmon wrote this article in Wired called “The Formula That Killed Wall Street“. It was about a formula called “Gaussian Copula”, which was a formula for estimating the joint probability of a set of events happening, if you knew the individual probabilities. It was a mathematical breakthrough.

Unfortunately, it fell into the hands of quants and traders who didn’t fully understand it, and they used it to derive joint probabilities of a large number of instruments put together. What they did not realize was that there was an error in the model (as there is in all models), and when they used the formula to tie up a large number of instruments, this error cascaded, resulting in an extremely inaccurate model, and subsequent massive losses (the last paragraph is based on my reading of the situation. Your mileage might vary).

In a blog post earlier this week at Reuters, Salmon returned to this article. He said:

And you can’t take technology further than its natural limits, either. It wasn’t really the Gaussian copula function which killed Wall Street, nor was it the quants who wielded it. Rather, it was the quants’ managers — the people whose limited understanding of copula functions and value-at-risk calculations allowed far too much risk to be pushed out into the tails. On Wall Street, just as in the rest of industry, a little bit of common sense can go a very long way.

I’m completely with him on this one. This blog post was in reference to Salmon’s latest article in Wired, which is about the four stages in which quants disrupt industries. You are encouraged to read both the Wired article and the blog post about it.

The essence is that it is easy to over-do analytics. Once you have a model that works in a few cases, you will end up putting too much faith into the model, and soon the model will become gospel, and you will build the rest of the organization around the model (this is Stage Three that Salmon talks about). For example, a friend who is a management consultant once mentioned about how bank lending practices are now increasingly formula driven. He mentioned reading a manager’s report that said “I know the applicant well, and am confident that he will repay the loan. However, our scoring system ranks him too low, hence I’m unable to offer the loan“.

The key issue, as Salmon mentions in his blog post, is that managers need to have at least a basic understanding of analytics (I had touched upon this issue in an earlier blog post). As I had written in that blog post, there can be two ways in which the analytics team can end up not contributing to the firm – firstly, people think they are geeks who nobody understands, and ignores them. Secondly, and perhaps more dangerously, people think of the analytics guys as gods, and fail to challenge them sufficiently, thus putting too much faith in models.

From this perspective, it is important for the analytics team to communicate well with the other managers – to explain the basic logic behind the models, so that the managers can understand the assumptions and limitations, and can use the models in the intended manner. What usually happens, though, is that after a few attempts when management doesn’t “get” the models, the analytics people resign themselves to using technical jargon and three letter acronyms to bulldoze their models past the managers.

The point of this post, however, is about black box models. Sometimes, you can have people (either analytics professionals or managers) using models without fully understanding them, and their assumptions. This inevitably leads to disaster. A good example of this are the traders and quants who used David Li’s Gaussian Copula, and ended up with horribly wrong models.

In order to prevent this, a good practice would be for the analytics people to be able to explain the model in an intuitive fashion (without using jargon) to the managers, so that they all understand the essence and nuances of the model in question. This, of course, means that you need to employ analytics people who are capable of effectively communicating their ideas, and employ managers who are able to at least understand some basic quant.

Should you have an analytics team?

In an earlier post, I had talked about the importance of business people knowing numbers and numbers people knowing business, and had put in a small advertisement for my consulting services by mentioning that I know both business and numbers and work at their cusp. In this post, I take that further and analyze if it makes sense to have a dedicated analytics team.

Following the data boom, most companies have decided (rightly) that they need to do something to take advantage of all the data that they have and have created dedicated analytics teams. These teams, normally staffed with people from a quantitative or statistical background, with perhaps a few MBAs, is in charge of taking care of all the data the company has along with doing some rudimentary analysis. The question is if having such dedicated teams is effective or if it is better to have numbers-enabled people across the firm.

Having an analytics team makes sense from the point of view of economies of scale. People who are conversant with numbers are hard to come by, and when you find some, it makes sense to put them together and get them to work exclusively on numerical problems. That also ensures collaboration and knowledge sharing and that can have positive externalities.

Then, there is the data aspect. Anyone doing business analytics within a firm needs access to data from all over the firm, and if the firm doesn’t have a centralized data warehouse which houses all its data, one task of each analytics person would be to get together the data that they need for their analysis. Here again, the economies of scale of having an integrated analytics team work. The job of putting together data from multiple parts of the firm is not solved multiple times, and thus the analysts can spend more time on analyzing rather than collecting data.

So far so good. However, writing a while back I had explained that investment banks’ policies of having exclusive quant teams have doomed them to long-term failure. My contention there (including an insider view) was that an exclusive quant team whose only job is to model and which doesn’t have a view of the market can quickly get insular, and can lead to groupthink. People are more likely to solve for problems as defined by their models rather than problems posed by the market. This, I had mentioned can soon lead to a disconnect between the bank’s models and the markets, and ultimately lead to trading losses.

Extending that argument, it works the same way with non-banking firms as well. When you put together a group of numbers people and call them the analytics group, and only give them the job of building models rather than looking at actual business issues, they are likely to get similarly insular and opaque. While initially they might do well, soon they start getting disconnected from the actual business the firm is doing, and soon fall in love with their models. Soon, like the quants at big investment banks, they too will start solving for their models rather than for the actual business, and that prevents the rest of the firm from getting the best out of them.

Then there is the jargon. You say “I fitted a multinomial logistic regression and it gave me a p-value of 0.05 so this model is correct”, the business manager without much clue of numbers can be bulldozed into submission. By talking a language which most of the firm understands you are obscuring yourself, which leads to two responses from the rest. Either they deem the analytics team to be incapable (since they fail to talk the language of business, in which case the purpose of existence of the analytics team may be lost), or they assume the analytics team to be fundamentally superior (thanks to the obscurity in the language), in which case there is the risk of incorrect and possibly inappropriate models being adopted.

I can think of several solutions for this – but irrespective of what solution you ultimately adopt – whether you go completely centralized or completely distributed or a hybrid like above – the key step in getting the best out of your analytics is to have your senior and senior-middle management team conversant with numbers. By that I don’t mean that they all go for a course in statistics. What I mean is that your middle and senior management should know how to solve problems using numbers. When they see data, they should have the ability to ask the right kind of questions. Irrespective of how the analytics team is placed, as long as you ask them the right kind of questions, you are likely to benefit from their work (assuming basic levels of competence of course). This way, they can remain conversant with the analytics people, and a middle ground can be established so that insights from numbers can actually flow into business.

So here is the plug for this post – shortly I’ll be launching short (1-day) workshops for middle and senior level managers in analytics. Keep watching this space