George Mallory and Metrics

It is not really known if George Mallory actually summited the Everest in 1924 – he died on that climb, and his body was only found in 1999 or so. It wasn’t his first attempt at scaling the Everest, and at 37, some people thought he was too old to do so.

There is this popular story about Mallory that after one of his earlier attempts at scaling the Everest, someone asked him why he wanted to climb the peak. “Because it’s there”, he replied.

George Mallory (extreme left) and companions

In the sense of adventure sport, that’s a noble intention to have. That you want to do something just because it is possible to do it is awesome, and can inspire others. However, one problem with taking quotes from something like adventure sport, and then translating it to business (it’s rather common to get sportspeople to give “inspirational lectures” to business people) is that the entire context gets lost, and the concept loses relevance.

Take Mallory’s “because it’s there” for example. And think about it in the context of corporate metrics. “Because it’s there” is possibly the worst reason to have a metric in place (or should we say “because it can be measured?”). In fact, if you think about it, a lot of metrics exist simply because it is possible to measure them. And usually, unless there is some strong context to it, the metric itself is meaningless.

For example, let’s say we can measure N features of a particular entity (take N = 4, and the features as length, breadth, height and weight, for example). There will be N! was in which these metrics can be combined, and if you take all possible arithmetic operations, the number of metrics you can produce from these basic N metrics is insane. And you can keep taking differences and products and ratios ad infinitum, so with a small number of measurements, the number of metrics you can produce is infinite (both literally and figuratively). And most of them don’t make sense.

That doesn’t normally dissuade our corporate “measurer”. That something can be measured, that “it’s there”, is sometimes enough reason to measure something. And soon enough, before you know it, Goodhart’s Law would have taken over, and that metric would have become a target for some poor manager somewhere (and of course, soon ceases to be a metric itself). And circular logic starts from there.

That something can be measured, even if it can be measured highly accurately, doesn’t make it a good metric.

So what do we do about it? If you are in a job that requires you to construct or design or make metrics, how can you avoid the “George Mallory trap”?

Long back when I used to take lectures on logical fallacies, I would have this bit on not mistaking correlation for causation. “Abandon your numbers and look for logic”, I would say. “See if the pattern you are looking at makes intuitive sense”.

I guess it is the same for metrics. It is all well to describe a metric using arithmetic. However, can you simply explain it in natural language, and can the listener easily understand what you are saying? And more importantly, does that make intuitive sense?

It might be fashionable nowadays to come up with complicated metrics (I do that all the time), in the hope that it will offer incremental benefit over something simpler, but more often than not the difficulty in understanding it makes the additional benefit moot. It is like machine learning, actually, where sometimes adding features can improve the apparent accuracy of the model, while you’re making it worse by overfitting.

So, remember that lessons from adventure sport don’t translate well to business. “Because it’s there” / “because it can be measured” is absolutely NO REASON to define a metric.

Financial ratio metrics

It’s funny how random things stick in your head a couple of decades later. I don’t even remember which class in IIMB this was. It surely wasn’t an accounting or a finance class. But it was one in which we learnt about some financial ratios.

I don’t even remember what exactly we had learnt that day (possibly return on invested capital?). I think it was three different financial metrics that can be read off a financial statement, and which then telescope very nicely together to give a fourth metric. I’ve forgotten the details, but I remember the basic concepts.

A decade ago, I used to lecture frequently on how NOT to do data analytics. I had this standard lecture that I called “smelling bullshit” that dealt with common statistical fallacies. Things like correlation-causation, or reasoning with small samples, or selection bias. Or stocks and flows.

One set of slides in that lecture was about not comparing stocks and flows. Most people don’t internalise it. It even seems like you cannot get a job as a journalist if you understand the distinction between stocks and flows. Every other week you see comparisons of someone’s net worth to some country’s GDP, for example. Journalists make a living out of this.

In any case, whenever I would come to these slides, there would always be someone in the audience with a training in finance who would ask “but what about financial ratios? Don’t we constantly divide stocks and flows there?”

And then I would go off into how we would divide a stock by a flow (typically) in finance, but we never compared a stock to a flow. For example, you can think of working capital as a ratio – you take the total receivables on the balance sheet and divide it by the sales in a given period from the income statement, to get “days of working capital”. Note that you are only dividing, not comparing the sales to the receivables. And then you take this ratio (which has dimension “days”) and then compare it across companies or across regions to do your financial analysis.

If you look at financial ratios, a lot of them have dimensions, though sometimes you don’t really notice it (I sometimes say “dimensional analysis is among the most powerful tools in data science”). Asset turnover, for example, is sales in a period divided by assets and has the dimension of inverse time. Inventory (total inventory on BS divided by sales in a period) has a dimension of time. Likewise working capital. Profit margins, however, are dimensionless.

In any case, the other day at work I was trying to come up with a ratio for something. I kept doing gymnastics with numbers on an excel sheet, but without luck. And I had given up.

Nowadays I have started taking afternoon walks at office (whenever I go there), just after I eat lunch (I carry a box of lunch which I eat at my desk, and then go for a walk). And on today’s walk (or was it Tuesday’s?) I realised the shortcomings in my attempts to come up with a metric for whatever I was trying to measure.

I was basically trying too hard to come up with a dimensionless metric and kept coming up with some nonsense or the other. Somewhere during my walk, I thought of finance, and financial metrics. Light bulb lit up.

My mistake had been that I had been trying to come up with something dimensionless. The moment I realised that this metric needs to involve both stocks and flows, I had it. To be honest, I haven’t yet come up with the perfect metric (this is for those colleagues who are reading this and wondering what new metric I’ve come up with), but I’m on my way there.

Since both a stock and a flow need to be measured, the metric is going to be a ratio of both. And it is necessarily going to have dimensions (most likely either time or inverse time).

And if I think about it (again I won’t be able to give specific examples), a lot of metrics in life will follow this pattern – where you take a stock and a flow and divide one by the other. Not just in finance, not just in logistics, not just in data science,  it is useful to think of metrics that have dimensions, and express them using those dimensions.

Some product manager (I have a lot of friends in that profession) once told me that a major job of being a product manager is to define metrics. Now I’ll say that dimensional analysis is the most fundamental tool for a product manager.

Legacy Metrics

Yesterday (or was it the day before? I’ve lost track of time with full time WFH now) the Times of India Bangalore edition had two headlines.

One was the Karnataka education minister BC Nagesh talking about deciding on school closures on a taluk (sub-district) wise basis. “We don’t want to take a decision for the whole state. However, in taluks where test positivity is more than 5%, we will shut schools”, he said.

That was on page one.

And then somewhere inside the newspaper, there was another article. The Indian Council for Medical Research has recommended that “only symptomatic patients should be tested for Covid-19”. However, for whatever reason, Karnataka had decided to not go by this recommendation, and instead decided to ramp up testing.

These two articles are correlated, though the paper didn’t say they were.

I should remind you of one tweet, that I elaborated about a few days back:

 

The reason why Karnataka has decided to ramp up testing despite advisory to the contrary is that changing policy at this point in time will mess with metrics. Yes, I stand by my tweet that test positivity ratio is a shit metric. However, with the government having accepted over the last two years that it is a good metric, it has become “conventional wisdom”. Everyone uses it because everyone else uses it. 

And so you have policies on school shutdowns and other restrictive measures being dictated by this metric – because everyone else uses the same metric, using this “cannot be wrong”. It’s like the old adage that “nobody got fired for hiring IBM”.

ICMR’s message to cut testing of asymptomatic individuals is a laudable one – given that an overwhelming number of people infected by the incumbent Omicron variant of covid-19 have no symptoms at all. The reason it has not been accepted is that it will mess with the well-accepted metric.

If you stop testing asymptomatic people, the total number of tests will drop sharply. The people who are ill will get themselves tested anyways, and so the numerator (number of positive reports) won’t drop. This means that the ratio will suddenly jump up.

And that needs new measures – while 5% is some sort of a “critical number” now (like it is with p-values), the “critical number” will be something else. Moreover, if only symptomatic people are to be tested, the number of tests a day will vary even more – and so the positivity ratio may not be as stable as it is now.

All kinds of currently carefully curated metrics will get messed up. And that is a big problem for everyone who uses these metrics. And so there will be pushback.

Over a period of time, I expect the government and its departments to come up alternate metrics (like how banks have now come up with an alternative to LIBOR), after which the policy to cut testing for asymptomatic people will get implemented. Until then, we should bow to the “legacy metric”.

And if you didn’t figure out already, legacy metrics are everywhere. You might be the cleverest data scientist going around and you might come up with what you think might be a totally stellar metric. However, irrespective of how stellar it is, that people have to change their way of thinking and their process to process it means that it won’t get much acceptance.

The strategy I’ve come to is to either change the metric slowly, in stages (change it little by little), or to publish the new metric along with the old one. Depending on how clever the new metric is, one of the metrics will die away.

Metrics

Over the weekend, I wrote this on twitter:

 

Surprisingly (at the time of writing this at least), I haven’t got that much abuse for this tweet, considering how “test positivity” has been held as the gold standard in terms of tracking the pandemic by governments and commentators.

The reason why I say this is a “shit metric” is simple – it doesn’t give that much information. Let’s think about it.

For a (ratio) metric to make sense, both the numerator and the denominator need to be clearly defined, and there needs to be clear information content in the ratio. In this particular case, both the numerator and the denominator are clear – latter is the number of people who got Covid tests taken, and the former is the number of these people who returned a positive test.

So far so good. Apart from being an objective measure, test positivity ratio is  also a “ratio”, and thus normalised (unlike absolute number of positive tests).

So why do I say it doesn’t give much information? Because of the information content.

The problem with test positivity ratio is the composition of the denominator (now we’re getting into complicated territory). Essentially, there are many reasons why people get tested for Covid-19. The most obvious reason to get tested is that you are ill. Then, you might get tested when a family member is ill. You might get tested because your employer mandates random tests. You might get tested because you have to travel somewhere and the airline requires it. And so on and so forth.

Now, for each of these reasons for getting tested, we can define a sort of “prior probability of testing positive” (based on historical averages, etc). And the positivity ratio needs to be seen in relation to this prior probability. For example, in “peaceful times” (eg. Bangalore between August and November 2021), a large proportion of the tests would be “random” – people travelling or employer-mandated. And this would necessarily mean a low test positivity.

The other extreme is when the disease is spreading rapidly – few people are travelling or going physically to work. Most of the people who get tested are getting tested because they are ill. And so the test positivity ratio will be rather high.

Basically – rather than the ratio telling you how bad the covid situation is in a region, it is influenced by how bad the covid situation is. You can think of it as some sort of a Schrödinger-ian measurement.

That wasn’t an offhand comment. Because government policy is an important input into test positivity ratio. For example, take “contact tracing”, where contacts of people who have tested positive are hunted down and also tested. The prior probability of a contact of a covid patient testing positive is far higher than the prior probability of a random person testing positive.

And so, as and when the government steps up contact tracing (as it does in the early days of each new wave), test positivity ratio goes up, as more “high prior probability” people get tested. Similarly, whether other states require a negative test to travel affects positivity ratio – the more the likelihood that you need a test to travel, the more likely that “low prior probability” people will take the test, and the lower the ratio will be. Or when governments decide to “randomly test” people (puling them off the streets of whatever), the ratio will come down.

In other words – the ratio can be easily gamed by governments, apart from just being influenced by government policy.

So what do we do now? How do we know whether the Covid-19 situation is serious enough to merit clamping down on people’s liberties? If test positivity ratio is a “shit metric” what can be a better one?

In this particular case (writing this on 3rd Jan 2022), absolute number of positive cases is as bad a metric as test positivity – over the last 3 months, the number of tests conducted in Bangalore has been rather steady. Moreover, the theory so far has been that Omicron is far less deadly than earlier versions of Covid-19, and the vaccination rate is rather high in Bangalore.

While defining metrics, sometimes it is useful to go back to first principles, and think about why we need the metric in the first place and what we are trying to optimise. In this particular case, we are trying to see when it makes sense to cut down economic activity to prevent the spread of the disease.

And why do we need lockdowns? To prevent hospitals from getting overwhelmed. You might remember the chaos of April-May 2021, when it was near impossible to get a hospital bed in Bangalore (even crematoriums had long queues). This is a situation we need to avoid – and the only one that merits lockdowns.

One simple measure we can use is to see how many hospital beds are actually full with covid patients, and if that might become a problem soon. Basically – if you can measure something “close to the problem”, measure it and use that as the metric. Rather than using proxies such as test positivity.

Because test positivity depends on too many factors, including government action. Because we are dealing with a new variant here, which is supposedly less severe. Because most of us have been vaccinated now, our response to getting the disease will be different. The change in situation means the old metrics don’t work.

It’s interesting that the Mumbai municipal corporation has started including bed availability in its daily reports.

Profit and politics

Earlier today I came across this article about data scientists on LinkedIn that I agreed with so much that I started wondering if it was simply a case of confirmation bias.

A few sentences (possibly taken out of context) from there that I agree with:

  • Many large companies have fallen into the trap that you need a PhD to do data science, you don’t.
  • There are some smart people who know a lot about a very narrow field, but data science is a very broad discipline. When these PhD’s are put in charge, they quickly find they are out of their league.
  • Often companies put a strong technical person in charge when they really need a strong business person in charge.
  •  I always found the academic world more political than the corporate world and when your drive is profits and customer satisfaction, that academic mindset is more of a liability than an asset.

Back to the topic, which is the last of these sentences. This is something I’ve intended to write for 5-6 years now, since the time I started off as an independent management consultant.

During the early days I took on assignments from both for-profit and not-for-profit organisations, and soon it was very clear that I enjoyed working with for-profit organisations a lot more. It wasn’t about money – I was fairly careful in my negotiations to never underprice myself. It was more to do with processes, and interactions.

The thing in for-profit companies is that objectives are clear. While not everyone in the company has an incentive to increase the bottom-line, it is not hard to understand what they want based on what they do.

For example, in most cases a sales manager optimises for maximum sales. Financial controllers want to keep a check on costs. And so on. So as part of a consulting assignment, it’s rather easy to know who wants what, and how you should pitch your solution to different people in order to get buy-in.

With a not-for-profit it’s not that clear. While each person may have their own metrics and objectives, because the company is not for profit, these objectives and metrics need not be everything they’re optimising for.

Moreover, in the not for profit world, the lack of money or profit as an objective means you cannot differentiate yourself with efficiency or quantity. Take the example of an organisation which, for whatever reason, gets to advice a ministry on a particular subject, and does so without a fee or only for a nominal fee.

How can a competitor who possibly has a better solution to the same problem “displace” the original organisation? In the business world, this can be done by showing superior metrics and efficiency and offering to do the job at a lower cost and stuff like that. In the not-for-profit setup, you can’t differentiate on things like cost or efficiency, so the only thing you can do is to somehow provide your services in parallel and hope that the client gets it.

And then there is access. If you’re a not-for-profit consultant who has a juicy project, it is in your interest to become a gatekeeper and prevent other potential consultants from getting the same kind of access you have – for you never know if someone else who might get access through you might end up elbowing you out.

Shoes and metrics

The best metric to measure the age of a pair of shoes is the distance walked in them

My latest pair of “belt chappli” (sandals with a belt going around the heels) is only ten months old, but has started wearing. Walking long distances in the said sandals has become a pain. The top is nice, the sole is fantastic, but the inner sole has gotten FUBARed. Maybe it was a stone that got stuck under my feet which I didn’t notice. Maybe it was several such small stones. But with the inner sole “gone”, time is nigh to possibly retire the chappal.

But then a good pair of sandals is supposed to last much longer (and I did 2 longish foreign trips in this period where this chappal didn’t travel with me). Historically, good sandals have lasted two years or more. And it is not that this one is cheap. I paid close to Rs. 2000 for it, and it’s branded, too (Lee Cooper), and I had found it after a lot o difficulty (three months of searching). That it has lasted less than a year is not fair.

But then the question arises as to whether I have the right metrics in place. The number of months or years that a pair of shoes lasts is an intuitive metric of its quality, but it is not the right one. For, a pair of shoes doesn’t wear when it is not worn! Of course there might be mild wear and tear due to weather conditions, but for a pair of shoes made of good leather, that can be ignored.

So maybe the best metric for a pair of shoes is the amount of time it is worn? Then again, while a shoe might wear while its worn, it doesn’t wear too much when it’s at rest –  I mean its shape changes to fit the wearer’s foot (over the medium term) and that might cause some wear and tear, but in the long run, there is unlikely to be much wear and tear at rest.

From that perspective, I hereby declare that the best metric to measure a shoe’s performance is the number of kilometres walked or run in it (latter causes significantly more wear and tear, but let’s assume that walking shoes and running shoes are mutually exclusive (which they’re not) ). This is an excellent because it takes care of a number of features that correlate with the wear and tear, and is not hard to fathom.

Going by this metric, my current pair of “belt chappli” has put in considerable service. Over the last ten months, the frequency of going on “beats” in Jayanagar has gone up, and the distance covered in each beat, too. Having pretty much stopped driving, I walk more than I used to, and this is my default shoe for such perambulations.

The problem now is the search cost – good belt chapplis that fit my feet are hard to find. It’s a liquidity problem, I think (:P). Maybe I should just consider getting the inner sole replaced and get on with this one.

Volatility of Human Body Weight

Ever since I shed roughly 20 kilos over the course of the second half of last year, I’ve become extremely weight-conscious. Given how quickly I shed so much weight, I’m paranoid that I might gain back so much again as quickly. This means I monitor my weight as closely as I can, limit myself in terms of “sin foods” and check my weight as often as possible, typically whenever I manage to make it to the gym (about twice a week on average).

Having been used to analog scales lifelong (there’s one at home, but it is wrongly calibrated I think), the digital scales (with 7-segment display) that are there at a gym provide me with a bit of a problem. I think they are too precise – they show my weight up to 1 place of decimal (in kilograms), and thinking about it, I think that much detail is unwarranted.

The reason being that I think given the normal cycles, I think the weight of the human body is highly volatile and measuring a volatile commodity at a scale finer than the volatility (when all you are interested in is the long-term average) is fraught with danger and inaccuracy. For example, every time you drink two glasses of water, your weight shoots up by half a kilo. Every time you pee, your weight correspondingly comes down. Every time you eat, up the weight goes, and every time you defecate, down go the scales.

Given this, I find the digital weighing machine at my gym a bit of a pain, but then I’m trying to figure out what the normal volatilty of the human body weight is, so that I can quickly catch on to any upward trend and make amends as soon as I can help it. Over the last couple of months, the machine has shown up various numbers between 73.8 and 75.5 and I have currently made a mental note that I’m not going to panic unless I go past 76.

I wonder if I’m making enough allowances for the volatility of my own body weight, and if I should reset my panic limits. I have other metrics to track my weight also – though my various trousers are all calibrated as “size 34” some have smaller waists than the others, and my algo every morning is to start wearing my pants starting from the smallest available, and go to work in the first one that fits, and when I know that I’m having trouble buttoning up my black chinos, that’s another alarm button.

Yeah sometimes I do think I’m too paranoid about my weight, but again it’s due to the speed at which I reduced that I’m anxious to make sure I don’t go back up at the same rate!

Update

Economist Ajay Shah sends me (and other members of a mailing list we belong to) this wonderful piece he has put together on weight management. Do read. But my question remains – how do you measure your body’s weight volatility?

Arranged Scissors 15: Stud and Fighter Beauty

Ok so here we come to the holy grail. The grand unification. Kunal Sawardekar can scream even more loudly now. Two concepts that i’ve much used and abused over the last year or so come together. In a post that will probably be the end of both these concepts in the blogging format. I think I want to write books. I want to write two books – one about each of these concepts. And after thinking about it, I don’t think a blook makes sense. Too  many readers will find it stale. So, this post signals the end of these two concepts in blog format. They’ll meet you soon, at a bookstore near you.

So this post is basically about how the aunties (basically women of my mother’s generation) evaluate a girl’s beauty and about how it significantly differs from the way most others evaluate it. For most people, beauty is a subjective thing. It is, as the proverb goes, in the eyes of the beholder. You look at the thing of beauty (not necessarily a joy forever) as a complete package. And decide whether the package is on hte whole beautiful. It is likely that different people have different metrics, but they are never explicit. Thus, different people find different people beautiful, and everyone has his/her share of beauty.

So I would like to call that as the “stud” way of evaluating beauty. It is instinctive. It is about insights hitting your head (about whether someone is beautiful or not). It is not a “process”. And it is “quick”. And “easy” – you don’t sweat much to decide whether someone is beautiful or not. It is the stud way of doing it. It is the way things are meant to be. Unfortunately, women of my mother’s generation (and maybe earlier generations) have decided to “fighterize” this aspect also.

So this is how my mother (just to take an example) goes about evaluating a girl. The girl is first split into components. Eyes, nose, hair, mouth, lips, cheeks, symmetry, etc. etc. Each of these components has its own weightage (differnet women use different weightages for evaluation. however for a particular woman, the weightage set is the same irrespective of who she is evaluating). And each gets marked on a 5-point likert scale (that’s what my mother uses; others might use scales of different lengths).

There are both subject-wise cutoffs and aggregate cutoff (this is based on the weighted average of scores for each component). So for a girl to qualify as a “CMP daughter-in-law”, she has to clear each of the subject cutoffs and also the total. Again – different women use different sets of cutoffs, but a particular woman uses only one set. And so forth.

I wonder when this system came into being, and why. I wonder if people stopped trusting their own judgment on “overall beauty” because of which they evolved this scale. I wonder if it was societal pressure that led to women look for a CMP daughter-in-law for which purpose they adopted this scale. It’s not “natural” so I can’t give a “selfish gene” argument in support of it. But I still wonder. And my mother still uses scales such as this to evaluate my potential bladees. Such are life.