Christian Rudder and Corporate Ratings

One of the studdest book chapters I’ve read is from Christian Rudder’s Dataclysm. Rudder is a cofounder of OkCupid, now part of the match.com portfolio of matchmakers. In this book, he has taken insights from OkCupid’s own data to draw insights about human life and behaviour.

It is a typical non-fiction book, with a studmax first chapter, and which gets progressively weaker. And it is the first chapter (which I’ve written about before) that I’m going to talk about here. There is a nice write-up and extract in Maria Popova’s website (which used to be called BrainPickings) here.

Quoting Maria Popova:

What Rudder and his team found was that not all averages are created equal in terms of actual romantic opportunities — greater variance means greater opportunity. Based on the data on heterosexual females, women who were rated average overall but arrived there via polarizing rankings — lots of 1’s, lots of 5’s — got exponentially more messages (“the precursor to outcomes like in-depth conversations, the exchange of contact information, and eventually in-person meetings”) than women whom most men rated a 3.

In one-hit markets like love (you only need to love and be loved by one person to be “successful” in this), high volatility is an asset. It is like option pricing if you think about it – higher volatility means greater chance of being in the money, and that is all you care about here. How deep out of the money you are just doesn’t matter.

I was thinking about this in some random context this morning when I was also thinking of the corporate appraisal process. Now, the difference between dating and appraisals is that on OKCupid you might get several ratings on a 5-point scale, but in your office you only get one rating each year on a 5-point scale. However, if you are a manager, and especially if you are managing a large team, you will GIVE out lots of ratings each year.

And so I was wondering – what does the variance of ratings you give out tell about you as a manager? Assume that HR doesn’t impose any “grading on curve” thing, what does it say if you are a manager who gave out an average rating of 3, with standard deviation 0.5, versus a manager who gave an average of 3, with all employees receiving 1s and 5s.

From a corporate perspective, would you rather want a team full of 3s, or a team with a few 5s and a few 1s (who, it is likely, will leave)? Once again, if you think about it, it depends on your Vega (returns to volatility). In some sense, it depends on whether you are running a stud or a fighter team.

If you are running a fighter team, where there is no real “spectacular performance” but you need your people to grind it out, not make mistakes, pay attention to detail and do their jobs, you want a team full of3s. The 5s in this team don’t contribute that much more than a 3. And 1s can seriously hurt your performance.

On the other hand, if you’re running a stud team, you will want high variance. Because by the sheer nature of work, in a stud team, the 5s will add significantly more value than the 1s might cause damage. When you are running a stud team, a team full of 3s doesn’t work – you are running far below potential in that case.

Assuming that your team has delivered, then maybe the distribution of ratings across the team is a function of whether it does more stud or fighter work? Or am I force fitting my pet theory a bit too much here?

Ratings revisited

Sometimes I get a bit narcissistic, and check how my book is doing. I log on to the seller portal to see how many copies have been sold. I go to the Amazon page and see what are the other books that people who have bought my book are buying (on the US store it’s Ray Dalio’s Principles, as of now. On the UK and India stores, Sidin’s Bombay Fever is the beer to my book’s diapers).

And then I check if there are new reviews of my book. When friends write them, they notify me, so it’s easy to track. What I discover when I visit my Amazon page are the reviews written by people I don’t know. And so far, most of them have been good.

So today was one of those narcissistic days, and I was initially a bit disappointed to see a new four-star review. I started wondering what this person found wrong with my book. And then I read through the review and found it to be wholly positive.

A quick conversation with the wife followed, and she pointed out that this reviewer perhaps reserves five stars for the exceptional. And then my mind went back to this topic that I’d blogged about way back in 2015 – about rating systems.

The “4.8” score that Amazon gives as an average of all the ratings on my book so far is a rather crude measure – since one reviewer’s 4* rating might differ significantly from another reviewer’s.

For example, my “default rating” for a book might be 5/5, with 4/5 reserved for books I don’t like and 3/5 for atrocious books. On the other hand, you might use the “full scale” and use 3/5 as your average rating, giving 4 for books you really like and very rarely giving a 5.

By simply taking an arithmetic average of ratings, it is possible to overstate the quality of a product that has for whatever reason been rated mostly by people with high default ratings (such a correlation is plausible). Similarly a low average rating for a product might mask the fact that it was rated by people who inherently give low ratings.

As I argue in the penultimate chapter of my book (or maybe the chapter before that – it’s been a while since I finished it), one way that platforms foster transactions is by increasing information flow between the buyer and the seller (this is one thing I’ve gotten good at – plugging my book’s name in random sentences), and one way to do this is by sharing reviews and ratings.

From this perspective, for a platform’s judgment on a product or seller (usually it’s the seller, but for products such as AirBnb, information about buyers also matters) to be credible, it is important that they be aggregated in the right manner.

One way to do this is to use some kind of a Z-score (relative to other ratings that the rater has given) and then come up with a normalised rating. But then this needs to be readjusted for the quality of the other items that this rater has rated. So you can think of some kind of a Singular Value Decomposition you can perform on ratings to find out the “true value” of a product (ok this is an achievement – using a linear algebra reference given how badly I suck in the topic).

I mean – it need not be THAT complicated, but the basic point is that it is important that platforms aggregate ratings in the right manner in order to convey accurate information about counterparties.

On Uppi2’s top rating

So it appears that my former neighbour Upendra’s new magnum opus Uppi2 is currently the top rated movie on IMDB, with a rating of 9.7/10.0. The Times of India is so surprised that it has done an entire story about it, which I’ve screenshot here: Screen Shot 2015-08-17 at 8.50.33 pm

The story also mentions that another Kannada movie RangiTaranga (which I’ve reviewed here) is in third spot, with a rating of 9.4 out of 10. This might lead you to wonder why Kannada movies have suddenly turned out to be so good. The answer, however, lies in simple logic.

The first is that both are relatively new movies and hence their ratings suffer from “small sample bias”. Of course, the sample isn’t that small – Uppi2 has received 1900 votes, which is 3 times as much as its 1999 prequel Upendra. Yet, it being a new movie, only a subset of the small set of people who have watched it so far would have reviewed it.

The second is selection bias. The people who see a movie in its first week are usually the hardcore fans, and in this case it is hardcore fans of Upendra’s movies. And hardcore fans usually find it hard to have their belief shaken (a version of what I’ve written about online opinions for Mint here), and hence they all give the movie a high rating.

As time goes by, and people who are not as hardcore fans of Upendra start watching and reviewing the movie, the ratings are likely to rationalise. Finally, ratings are easy to rig, especially when samples are small. For example, an Upendra fan club might have decided to play up the movie online by voting en masse on IMDB, and pushing up its ratings. This might explain both why the movie already has 1900 ratings in four days, and most of them are extremely positive.

The solution for this is for the rating system (IMDB in this case) to pay more weightage for “verified ratings” (by people who have rated more movies in the past, for instance), or remove highly correlated ratings. Right now, the rating algorithm seems pretty naive.

Coming back to Uppi2, from what I’ve heard from people, the movie is supposed to be really good, though perhaps not 9.7 good. I plan to watch the movie in the next few days and will write a review once I do so.

Meanwhile, read this absolutely brilliant review (in Kannada) written by this guy called “Jogi”

Rating systems need to be designed carefully

Different people use the same rating scale in different ways. Hence, nuance is required while aggregating ratings taking decisions based on them

During the recent Times Lit Fest in Bangalore, I was talking to some acquaintances regarding the recent Uber rape case (where a car driver hired though the Uber app in Delhi allegedly raped a woman). We were talking about what Uber can potentially do to prevent bad behaviour from drivers (which results in loss of reputation, and consequently business, for Uber), when one of them mentioned that the driver accused of rape had an earlier complaint against him within the Uber system, but because the complainant in that case had given him “three stars”, Uber had not pulled him up.

Now, Uber has a system of rating both drivers and passengers after each ride – you are prompted to give the rating as soon as the ride is done, and you are unable to proceed to your next booking unless you’ve rated the previous ride. What this ensures is that there is no selection bias in rating – typically you leave a rating only when the product/service has been exceptionally good or bad, leading to skewed ratings. Uber’s prompts imply that there is no opportunity for such bias and ratings are usually fair.

Except for one problem – different people have different norms for rating. For example, i believe that there is nothing “exceptional” that an Uber driver can do for me, and hence my default rating for all “satisfactory” rides is a 5, with lower scores being used progressively for different levels of infractions. For another user, for example, the default might be 1, with 2 to 5 being used for various levels of good service. Yet another user might use only half the provided scale, with 3 being “pathetic”, for example. I once worked for a firm where annual employee ratings came out on a similar five-point scale. Over the years so much “rating inflation” had happened that back when I worked there anything marginally lower than 4 on 5 was enough to get you sacked.

What this means is that arithmetically averaging ratings across raters, and devising policies based on particular levels of ratings is clearly wrong. For example, when in the earlier case (as mentioned by my acquaintance) a user rated the offending driver a 3, Uber should not have looked at the rating in isolation, but in relation to other ratings given by that particular user (assuming she had used the service before).

It is a similar case with any other rating system – a rating looked at in isolation tells you nothing. What you need to do is to look at it in relation to other ratings by the user. It is also not enough to look at a rating in relation to just the “average” rating given by a user – variance also matters. Consider, for example, two users. Ramu uses 3 for average service, 4 for exceptional and 2 for pathetic. Shamu also uses 3 for average, but he instead uses the “full scale”, using 5 for exceptional service and 1 for pathetic. Now, if a particular product/service is rated 1 by both Ramu and Shamu, it means different things – in Shamu’s case it is “simply pathetic”, for that is both the lowest score he has given in the past and the lowest he can give. In Ramu’s case, on the other hand, a rating of 1 can only be described as “exceptionally pathetic”, for his variance is low and hence he almost never rates someone below 2!

Thus, while a rating system is a necessity in ensuring good service in a two-sided market, it needs to be designed and implemented in a careful manner. Lack of nuance in designing a rating system can result in undermining the system and rendering it effectively useless!