computer science – Pertinent Observations

I can retire

I was playing with the “custom GPT” feature of ChatGPT. I asked it to build a bot to answer questions based on content in this blog. And then I started trying it out. Here is a sample:

You

explain the studs and fighters theory

NED Bot

The “Studs and Fighters” theory is a framework used to describe two distinct types of competitive individuals or entities, often in the context of sports, business, or any field involving competition. Here’s a casual explanation:

Studs: These are the naturally talented. They’ve got the skills, often inherent, that make them stand out with seemingly less effort. In sports, think of the athlete who executes plays with grace and makes it all look easy. In business, it could be that entrepreneur whose ventures just always seem to bloom.

Fighters: These folks might not have the natural flair of the studs, but boy do they have grit. They’re the ones who compensate for any lack of natural talent with sheer hard work and perseverance. They fight for every point, work relentlessly, and often overcome obstacles through tenacity. Think of the athlete who may not be the most skilled on the field but is always the last to leave practice.

The theory goes that in the long run, the Fighters tend to outperform the Studs because their success is not reliant on inherent talent, which can plateau, but on continuous improvement and adaptation. However, when a Stud also has the mindset of a Fighter, that’s when you get those once-in-a-generation superstars. It’s not just about what you’re born with or what you’re willing to do; it’s about combining the best of both worlds if you can.

You

who came up with this thoery?

NED Bot

The “Studs and Fighters” theory was popularized by the blog “noenthuda.com,” which is run by Karthik Shashidhar. He formulated this concept based on observations in various competitive fields, and it has since been referenced in discussions about performance, talent, and success. The theory encapsulates the dichotomy between natural talent and hard-earned skill, offering a lens through which to view competition and development in individuals.

Not bad, I must say! This took about five minutes to program.

I’ve published the bot. You can find it here. According to ChatGPT itself, you don’t need a premium ChatGPT account to USE this. Have fun, and stress test!

A few months back, I came across this article that talked about margins in the software industry. Long ago, computer software was well known to be an insanely high gross margin industry. However, it is not the case any more.

If you look at SaaS (software as a service) companies, a lot of them barely make much profits any more. So what changed?

The answer is infrastructure. In the olden days, when all hardware was “on premise”, software would be a bunch of lines of code that would get sold, and then run on the client’s on-premise hardware. Thus, once the code had been written and tested and perfected, the only cost that the vendor faced was to install the code on the client’s hardware (including the cost of engineers involved in the installation). And the margins soared.

Then (I’m still paraphrasing the article that I had read, and now can’t find), the cloud happened. Hardware wasn’t all on-premise any more. People figured out that software could be sold “as a service” (hence SaaS). Which means, instead of charging for installing some code on a computer, you could charge for API hits, or function calls. Everything became smooth.

The catch, though, was that the software would now have to be hosted on hardware maintained (in the cloud) by the vendor. Which meant now the marginal cost of delivery suddenly became non-zero. Rather, it went from $O(1)$ (one time installation) to $O(n)$ (costing each time it gets hit, or the time for which it is maintained). And this had a material impact on software margins.

I’m thinking of this now in the wake of new-fangled open source LLMs that keep getting announced every day. Every new LLM that comes out gets compared with ChatGPT, and people tell you that this new LLM is “open source”. And you get excited that you can get for free what you would have to pay for with ChatGPT.

Of course, the catch here is that ChatGPT is like SaaS – not only does it provide you the “LLM service” it also hosts the service for you and answers your questions, for a fee.

These open source models are like the traditional “on-premise” computer software industry – they have good code but the issue of course is that you need to supply your own hardware. Add in the cost of maintaining the said hardware, and you see where you might spend with the open source LLMs.

That said, Free != Open Source. The Open Source LLMs are not only free, but also open source – and so, the real value in them is that you can actually build on the existing algorithms and not have to pay a fee (except for your own infrastructure).

And from that perspective, it’s exciting that so many new tools are coming along.

Channel Coding Theorem in Real Life

One of my favourite concepts in Computer Science is Shannon’s Channel Coding Theorem. This theorem is basically about the efficiency of communication over a noisy channel. And as I was thinking a few minutes back, this has interesting implications in real life as well, well away from the theory of communication.

I don’t have that much understanding of the rigorous explanation of the theorem. However, I absolutely love the central idea of it – that the noisier a channel is, the more the redundancy you need in your communication, and thus the slower is your communication. A corollary of this is that every channel has a “natural maximum speed”, and as long as you try to communicate within that speed, you can communicate reliably.

I won’t go into the technical details here – that involves assuming that the channel loses (or garbles) X% of bits, and then constructing a redundant code that shows that even with this loss, you can communicate effectively.

Anyway, let’s leave behind the theory communication and go on to real life.

I’ve found that I communicate badly when I’m not sure what language to talk in. If I’m talking in English with someone who I know knows good English, I communicate rather well (like my writing ) . However, if I’m not sure about the quality of language of the other person, I hesitate. I try to force myself to find simpler / more obvious words, and that disturbs my flow of thought, and I stammer.

Similarly, when I’m not sure whether to talk in Kannada or English (the two languages I’m very comfortable in), I stammer heavily. Again, because I’m not sure if the words I would naturally use will be understood by the other person (the counterparty’s comprehension being the “noise in the channel” here), I slow down, get jittery, and speak badly.

Then of course, there is the very literal interpretation of the channel coding theorem – when your internet connection (or call quality in general) is bad, you end up having to speak slower. When I was hunting for a job in 2020, I remember doing badly in a few interviews because of the quality (or lack thereof) of the internet connection (this was before I had discovered that Google Meet performs badly on Safari).

Similarly, sometime last month, I had thought I had prepared well for what I thought was going to be a key conversation at work. The internet was bad, we couldn’t hear each other and kept repeating (redundancy is how you overcome the noise in the channel), and that diminished throughput massively. Given the added difficulty in communication, I didn’t bring up the key points I had prepared for. It was a damp squib.

Related to this is when you aren’t sure if the person you are speaking to can hear clearly. This disability again clouds the communication channel, meaning you need to build in redundancy, and thus a reduction in throughput.

When you are uncertain of yourself, or underconfident, you end up tending to do badly. That is because when you are uncertain, you aren’t sure if the other person will fully understand what you are going to say. Consequently, you end up talking slower, building redundancy in your speech, etc. You are more doubtful of what you are going to say, and don’t take risks, since your lack of confidence has clouded the “communication channel”, thus depressing your throughput.

Again a lot of this might apply to me alone – I function best when I’m talking / writing at a certain minimum throughput, and operating at anywhere below that makes me jittery and underconfident and a bad communicator. It is no surprise that my writing really took off once I got a computer of my own.

That was in the beginning of July 2004, and within a month, I had started (the predecessor of) this blog. I’ve been blogging for 19 years now.

That aside aside, the channel coding theorem works in non-verbal contexts as well. Back in 2016, before my daughter was born, I remember reading somewhere that tentative mothers lead to cranky babies. The theory was that if the mum was anxious or afraid while handling her baby, the baby wouldn’t perceive the signals of touch sufficiently, and being devoid of communication, become cranky.

We had seen a few examples of this among relatives and friends (and this possibly applies to me as well – my mother had told me that I was the first newborn she ever handled, and so she was a bit tentative in handling me). This again can be explained using the Channel Coding Theorem.

When the mother’s touch is tentative, it is as if the touchy channel between mother and child has some “noise”. The tentativeness of the touch means the baby is not really sure of what the mother is “saying”. With touch, unlike language or bits, redundancy is harder. And so the child goes up insufficiently connected to its mother.

Conversely, later on in life, these tentative mothers tend to bring in redundancy in their communications with their (now jittery) children, and end up holding them too hard, and not letting them go (and some of these children go to therapists, who inevitably blame it on the mothers ). Ultimately, all of this stems from the noise in the initial communication channel (thanks to the tentativeness of the source).

Ok I’ve rambled on here, so will stop now. However, now that I’ve seeded this thought in you, you too will start seeing the channel coding theorem everywhere (oh – if you think this post is badly written, then that is again like reading this over a noisy channel. And you will get irritated with the lack of throughput and pack).

Pre-trained models

On Sunday evening, we were driving to a relative’s place in Mahalakshmi Layout when I almost missed a turn. And then I was about to miss another turn and my wife said “how bad are you with directions? You don’t even know where to turn!”.

“Well, this is your area”, I told her (she grew up in Rajajinagar). “I had very little clue of this part of town till I married you, so it’s no surprise I don’t know how to go to your cousin’s place”.

“But they moved into this house like six months ago, and every time we’ve gone there together. So if I know the route, why can’t you”, she retorted.

This gave me a trigger to go off on a rant on pre-trained models, and I’m going to inflict that on you now.

For a long time, I didn’t understand what the big deal was on pre-trained machine learning models. “If it’s trained on some other data, how will it even work with my data”, I wondered. And then recently I started using GPT4 and other similar large language models. And I started reading blogposts on how with very little finetuning these models can do “gymnastics”.

Having grown up in North Bangalore, my wife has a “pretrained model” of that part of town in her head. This means she has sufficient domain knowledge, even if she doesn’t have any specific knowledge. Now, with a small amount of new specific information (the way to her cousins’s new house, for example), it is easy for her to fit in the specific information to her generic knowledge and get a clear idea on how to get there.

(PS: I’m not at all suggesting that my wife’s intelligence is artificial here)

On the other hand, my domain knowledge of North Bangalore is rather weak, despite having lived there for two years. For the longest time, Mallewaram was a Chakravyuha – I would know how to go there, but not how to get back. Given this lack of domain knowledge, the little information on the way to my wife’s cousin’s new house is not sufficient for me to find my way there.

It is similar with machines. LLMs and other pre-trained models have sufficient “generic domain knowledge” in lots of things, thanks to the large amounts of data they’ve been trained on. As a consequence, if you can train them on fairly small samples of specific data, they are able to generalise around this specific data and learn around them.

More pertinently, in real life, depending upon our “generic domain knowledge” of different domains, the amount of information that you and I will need to learn a certain amount about a certain domain can be very very different.

Everything is context-sensitive!

Channelling

I’m writing this five minutes after making my wife’s “coffee decoction” using the Bialetti Moka pot. I don’t like chicory coffee early in the morning, and I’m trying to not have coffee soon after I wake up, so I haven’t made mine yet.

While I was filling the coffee into the Moka Pot, I was thinking of the concept of channelling. Basically, if you try to pack the moka pot too tight with coffee powder, then the steam (that goes through the beans, thus extracting the caffeine) takes the easy way out – it tries to create a coffee-less channel to pass through, rather than do the hard work of extracting coffee as it passes through the layer of coffee.

I’m talking about steam here – water vapour, to be precise. It is as lifeless as it could get. It is the gaseous form of a colourless odourless shapeless liquid. Yet, it shows the seeming “intelligence” of taking the easy way out. Fundamentally this is just physics.

This is not an isolated case. Last week, at work, I was wondering why some algorithm was returning a “negative cost” (I’m using local search for that, and after a few iterations, I found that the algorithm is rapidly taking the cost – which is supposed to be strictly positive – into deep negative territory). Upon careful investigation (thankfully it didn’t take too long), it transpired that there was a penalty cost which increased non-linearly with some parameter. And the algo had “figured” that if this parameter went really high, the penalty cost would go negative (basically I hadn’t done a good job of defining the penalty well). And so would take this channel.

Again, this algorithm has none of the supposedly scary “AI” or “ML” in it. It is a good old rule-based system, where I’ve defined all the parameters and only the hard work of finding the optimal solution is left to the algo. And yet, it “channelled”.

Basically, you don’t need to have got a good reason for taking the easy way out now. It is not even human, or “animal” to do that – it is simply a physical fact. When there exists an easier path, you simply take that – whether you are an “AI” or an algorithm or just steam!

I’ll leave you with this algo that decided to recognise sheep by looking for meadows (this is rather old stuff).

Order of guests’ arrival

When I’m visiting someone’s house and they have an accessible bookshelf, one of the things I do is to go check out the books they have. There is no particular motivation, but it’s just become a habit. Sometimes it serves as conversation starters (or digressors). Sometimes it helps me understand them better. Most of the time it’s just entertaining.

So at a friend’s party last night, I found this book on Graph Theory. I just asked my hosts whose book it was, got the answer and put it back.

As many of you know, whenever we host a party, we use graph theory to prepare the guest list. My learning from last night’s party, though, is that you should not only use graph theory to decide WHO to invite, but also to adjust the times you tell people so that the party has the best outcome possible for most people.

With the full benefit of hindsight, the social network at last night’s party looked approximately like this. Rather, this is my interpretation of the social network based on my knowledge of people’s affiliation networks.

This is approximate, and I’ve collapsed each family to one dot. Basically it was one very large clique, and two or three other families (I told you this was approximate) who were largely only known to the hosts. We were one of the families that were not part of the large clique.

This was not the first such party I was attending, btw. I remember this other party from 2018 or so which was almost identical in terms of the social network – one very large clique, and then a handful of families only known to the hosts. In fact, as it happens, the large clique from the 2018 party and from yesterday’s party were from the same affiliation network, but that is only a coincidence.

Thinking about it, we ended up rather enjoying ourselves at last night’s party. I remember getting comfortable fairly quickly, and that mood carrying on through the evening. Conversations were mostly fun, and I found myself connecting adequately with most other guests. There was no need to get drunk. As we drove back peacefully in the night, my wife and daughter echoed my sentiments about the party – they had enjoyed themselves as well.

This was in marked contrast with the 2018 party with the largely similar social network structure (and dominant affiliation network). There we had found ourselves rather disconnected, unable to make conversation with anyone. Again, all three of us had felt similarly. So what was different yesterday compared to the 2018 party?

I think it had to do with the order of arrival. Yesterday, we were the second family to arrive at the party, and from a strict affiliation group perspective, the family that had preceded us at the party wasn’t part of the large clique affiliation network (though they knew most of the clique from beforehand). In that sense, we started the party on an equal footing – us, the hosts and this other family, with no subgroup dominating.

The conversation had already started flowing among the adults (the kids were in a separate room) when the next set of guests (some of them from the large clique arrived), and the assimilation was seamless. Soon everyone else arrived as well.

The point I’m trying to make here is that because the non-large-clique guests had arrived first, they had had a chance to settle into the party before the clique came in. This meant that they (non-clique) had had a chance to settle down without letting the party get too cliquey. That worked out brilliantly.

In contrast, in the 2018 party, we had ended up going rather late which meant that the clique was already in action, and a lot of the conversation had been clique-specific. This meant that we had struggled to fit in and never really settled, and just went through the motions and returned.

I’m reminded of another party WE had hosted back in 2012, where there was a large clique and a small clique. The small clique had arrived first, and by the theory in this post, should have assimilated well into the party. However, as the large clique came in, the small clique had sort of ended up withdrawing into itself, and I remember having had to make an effort to balance the conversation between all guests, and it not being particularly stress-free for me.

The difference there was that there were TWO cliques with me as cut-vertex. Yesterday, if you took out the hosts (cut-vertex), you would largely have one large clique and a few isolated nodes. And the isolated nodes coming in first meant they assimilated both with one another and with the party overall, and the party went well!

And now that I’ve figured out this principle, I might break my head further at the next party I host – in terms of what time I tell to different guests!

Sierpinski Triangles

On Saturday morning, my daughter had made some nice art with sketch pen on an A4 paper. It was rather “geometric” consisting of repeating patterns across the page. My wife took one look at it and said, “do you know that you can make such art with computers also? Your father has made some”.

Some drawings I had made using code, back in 2016

“Reallly?”, piped the daughter. I had been intending for a while to start teaching her to code (she is six), and figured this was the perfect trigger, and said I will teach her.

A quick search revealed that there is an “ACS Logo” for Mac (Logo was the first “programming language” I had learnt, when I was nine). I quickly downloaded it on her computer (my wife’s old Macbook Air) and figured I remembered most of the commands.

And then I started typing, and showed her what they had showed me back in a “computer class” behind my house in 1992 – FD for “forward”. RT for right turn. HT for hide turtle. Etc. Etc.

Soon she was engrossed in it. Thankfully she has learnt angles in her school, though it took her some trial and error to figure out how much to turn by for different shapes (later I was thinking this can also serve as a good “angles revision” for her during her ongoing summer holidays).

With my wife having reminded me that I could produce images through code, I realised that as my daughter was engrossed in her “coding”, I should do some “coding art” on my own. All she needed was some occasional input, and for me to sit right next to her.

Last Monday I had got a bit of a scare – at work, I needed to generate randomly distributed points in a regular hexagon. A lookup online told me that I could just get a larger number of randomly distributed points in a bounding rectangle, and then only pick points within the hexagon. And then take a random sample of those.

This had meant that I needed to write equations for whether a point lay inside a hexagon. And I realised I’d forgotten ALL my coordinate geometry. It took me over half an hour to get the equation for the sides of the hexagon right – I’m clearly rusty.

And on Saturday, as I sat down to make some “computer art”, I decided I’ll make some fractals. Why don’t I make some Sierpinski Triangles, I thought. I started breaking down what code I needed to write.

First, given an equilateral triangle, I had to return three similar equilateral triangles, each of half the side length of the original triangles.

Then, given the centroid of an equilateral triangle and the length of each side, I had to return the vertices.

Once these two functions had been written, I could just chain them (after running the first one recursively) and then had to just plot to get the Sierpinski triangle.

And then I had my second scare of the week – not only had I forgotten my coordinate geometry – I had forgotten my trigonometry as well. Again I messed up a few times, but the good thing about programming with a computer is that i could do trial and error. Soon I had it right, and started producing Sierpinski triangles.

Then, there was another problem – my code was really inefficient. If I went beyond depth 4 or 5, the figures would take inordinately long to render. Since I was coding in R, I set about vectorising all my code. In R you don’t write loops if you can help it – instead, you apply functions on entire vectors. This again took some time, and then I had the triangles ready. I proudly showed them off to my daughter.

“Appa, why is it that as you increase the number it becomes greyer”, she asked . I explained how with each step, you were taking away more of the filled areas from the triangles. Then I figured this wasn’t that good-looking – maybe I should colour it.

And so I wrote code to colour the triangles. Basically, I started recursively colouring them – the top third green, left third red and right third blue (starting with a red base). This is what I ended up producing:

And this is what my daughter produced at the same time, using Logo:

I forgot to “HT” before taking the screenshot. This is a “lollipop”

Code Density

As many of the regular readers of this blog know, I largely use R for most of my data work. There have been a few occasions when I’ve tried to use Python, but have found that I’m far less efficient in that than I am with R, and so abandoned it, despite the relative ease of putting things into production.

Now in my company, like in most companies, people use both Python and R (the team that reports to me largely uses R, everyone else largely uses Python). And while till recently I used to claim that I’m multilingual in the sense that I can read Python code fairly competently, of late I’m not sure I am. I find it increasingly difficult to parse and read production grade python code.

And now, after some experiments with ChatGPT, and exploring other people’s codes, I have an idea on why I’m finding it hard to read production-grade Python code. It has to do with “code density”.

Of late I’ve been experimenting with Spark (finally, in this job I do a lot of “big data” work – something I never had to in my consulting career prior to this). Related to this, I was reading someone’s PySpark code.

And then it hit me – the problem (rather, my problem) with Python is that it is far more verbose than R. The number of characters or lines of code required to do something in Python is far more than what you need in R (especially if you are using the tidyverse family of packages, which I do, including sparklyr for spark).

Why does the density of code matter? It is to do with aesthetics and modularity and ease of understanding.

Yesterday I was writing some code that I plan to put into production. After a few hours of coding, I looked at the code and felt disgusted with myself – it was a way too long monolithic block of code. It might have been good when I was writing it, but I knew that if I were to revisit it in a week or two, I wouldn’t be able to understand what the hell was happening there.

I’ve never worked as a professional software engineer, but with the amount of coding I’ve done, I’ve worked out what is a “reasonable length for a code block”. It’s like that apocryphal story of Indian public examiners for high school exams who evaluate history answers based on how long they are – “if they were to place an ordinary Reynolds 045 pen vertically on the sheet, the answer should be longer than that for the student to get five marks”.

An answer in a high school history exam needs to be longer than this. A code block or function should be shorter than this

It’s the reverse here. Approximately speaking, if you were to place a Reynolds pen vertically on screen (at your normal font size), no function (or code block) can be longer than the pen.

This easily approximates how much the eye can see on one normal Macbook screen (I use a massive external monitor at work, and a less massive, but equally wide, one at home). If you have to keep scrolling up and down to understand the full logic, there is a higher chance you might make mistakes, and higher difficulty for someone to understand the code.

Till recently (as in earlier this week) I would crib like crazy that people coding in Python would make their code “too modular”. That I would have to keep switching back and forth between different functions (and classes!!) to make sense of some logic, and about how that would make codes hard to debug (I still think there is a limit to how modular you can make your ETL code).

Now, however (I’m writing this on a Saturday – I’m not working today), from the code density perspective, it all makes sense to me.

The advantage of R is that because the code is far denser, you can pack in a far greater amount of logic in a Reynolds pen length of code. So over the years I’ve gotten used to having this much logic being presented to me in one chunk (without having to scroll or switch functions).

The relatively lower density of Python, however, means that the same amount of logic that would be one function in R is now split across several different functions. It is not that the writer of the code is “making this too modular” or “writing functions just for the heck of it”. It is just that their “mental Reynolds pens” doesn’t allow them to pack in more lines in a chunk or function, and Python’s density means there is only so much logic that can go in there.

As part of my undergrad, I did a course on Software Engineering (and the one thing the professor told us then was that we should NOT take up software engineering as a career – “it’s a boring job”, he had said). In that, one of the things we learnt was that in conventional software services contexts, billing would happen as a (nonlinear) function of “kilo lines of code” (this was in early 2003).

Now, looking back, one thing I can say is that the rate per kilo line of R code ought to be much higher than the rate per kilo line of Python code.

Cross posted on my now-largely-dormant Art of Data Science newsletter

The Second Great Wall (of programming)

Back in 2000, I entered the Computer Science undergrad program at IIT Madras thinking I was a fairly competent coder. In my high school, I had a pretty good reputation in terms of my programming skills and had built a whole bunch of games.

By the time half the course was done I had completely fallen out of love with programming, deciding a career in Computer Science was not for me. I even ignored Kama (current diro)’s advice and went on to do an MBA.

He had given me a 3 hour lecture about how I'm wasting my life when I happened to tell him that I was preparing for CAT. https://t.co/1rWG0BnymW

— Karthik S (@karthiks) April 19, 2023

What had happened? Basically it was a sudden increase in the steepness of the learning curve. Or that I’m a massive sucker for user experience, which the Computer Science program didn’t care for.

Back in school, my IDE of choice (rather the only one available) was TurboC, a DOS-based program. You would write your code, and then hit Ctrl+F9 to run the program. And it would just run. I didn’t have to deal with any technical issues. Looking back, we had built some fairly complex programs just using TurboC.

And then I went to IIT and found that there was no TurboC, no DOS. Most computers there had an ancient version of Unix (or worse, Solaris). These didn’t come with nice IDEs such as TurboC. Instead, you had to use vi (some of the computers were so old they didn’t even have vim) to write the code, and then compile it from outside.

Difficulties in coming to terms with vi meant that my speed of typing dropped. I couldn’t “code at the speed of thought” any more. This was the first give up moment.

Then, I discovered that C++ had now got this new set of “standard template libraries” (STL) with vectors and stuff. This was very alien to the way I had learnt C++ in school. Also I found that some of my classmates were very proficient with this, and I just couldn’t keep up with this. The effort seemed too much (and the general workload of the program was so high that I couldn’t get much time for “learning by myself”), so I gave up once again.

Next, I figured that a lot of my professors were suckers for graphic UIs (though they denied us good UX by denying us good computers). This, circa 2001-2, meant programming in Java and writing applets. It was a massive degree of complexity (and “boringness”) compared to the crisp C/C++ code I was used to writing. I gave up yet again.

I wasn’t done with giving up yet. Beyond all of this, there was “systems programming”. You had to write some network layers and stuff. You had to go deep into the workings of the computer system to get your code to run. This came rather intuitively to most of my engineering-minded classmates. It didn’t to me (programming in C was the “deepest” I could grok). And I gave up even more.

A decade after I graduated from IIT Madras, I visited IIM Calcutta to deliver a lecture. And got educated.

I did my B.Tech. project in “theoretical computer science”, managed to graduate and went on to do an MBA. Just before my MBA, I was helping my father with some work, and he figured I sucked at Excel. “What is the use of completing a B.Tech. in computer science if you can’t even do simple Excel stuff?”, he thundered.

In IIMB, all of us bought computers with pirated Windows and Office. I started using Excel. It was an absolute joy. It was a decade before I started using Apple products, but the UX of Windows was such a massive upgrade compared to what I’d seen in my more technical life.

In my first job (where I didn’t last long) I learnt the absolute joy of Visual Basic macros for Excel. This was another level of unlock. I did some insane gymnastics in that. I pissed off a lot of people in my second job by replicating what they thought was a complex model on an Excel sheet. In my third job, I replaced a guy on my team with an Excel macro. My programming mojo was back.

Goldman Sachs’s SLANG was even better. By the time I left from there, I had learnt R as well. And then I became a “data scientist”. People asked me to use Python. I struggled with it. After the user experience of R, this was too complex. This brought back bad memories of all the systems programming and dealing with bad UX I had encountered in my undergrad. This time I was in control (I was a freelancer) so I didn’t need to give up – I was able to get all my work done in R.

The second giving up

I’ve happily used R for most of my data work in the last decade. Recently at work I started using Databricks (still write my code in R there, using sparklyr), and I’m quite liking that as well. However, in the last 3-4 months there has been a lot of developments in “AI”, which I’ve wanted to explore.

The unfortunate thing is that most of this is available only in Python. And the bad UX problem is back again.

Initially I got excited, and managed to install Stable Diffusion on my personal Mac. I started writing some OpenAI code as well (largely using R). I started tracking developments in artificial intelligence, and trying some of them out.

And now, in the last 2-3 weeks, I’ve been struggling with “virtual environments”. Each newfangled open-source AI that is released comes with its own codebase and package requirements. They are all mutually incompatible. You install one package, and you break another package.

The “solution” to this, from what I could gather, is to use virtual environments – basically a sandbox for each of these things that I’ve downloaded. That, I find, is grossly inadequate. One of the points of using open source software is to experiment with connecting up two or more of them. And if each needs to be in its own sandbox, how is one supposed to do this? And how are all other data scientists and software engineers okay with this?

This whole virtual environment mess means that I’m giving up on programming once again. I won’t fully give up – I’ll continue to use R for most of my data work (including my job), but I’m on the verge of giving up in terms of these “complex AI”.

It’s the UX thing all over again. I simply can’t handle bad UX. Maybe it’s my ADHD. But when something is hard to use, I simply don’t want to use it.

And so I’m giving up once again. Tagore was right.

Muggoos and overfitting

Back when I was a student, there was this (rather large) species of students who we used to call “muggoos”. They were called that because they would have a habit of “mugging up the answers” – basically they would learn verbatim stuff in the textbooks and other reading material, and then just spit it out during the exams.

They were incredibly hardworking, of course – since the volume of stuff to mug was immense – and they would make up for their general lack of understanding of the concepts with their massive memories and rote learning.

On average, they did rather well – with all that mugging, the downside was floored. However, they would stumble badly in case of any “open book exams” (where we would be allowed to carry textbooks into the exams) – since the value of mugging there was severely limited. I remember having an argument once with some topper-type muggoos (with generally much better grades than me ) on whether to keep exams in a particular course open book or closed book. They all wanted closed book of course.

This morning, I happened to remember this species while chatting with a friend. He was sending me some screenshots from ChatGPT and was marvelling at something which it supposedly made up (I remembered it as a popular meme from 4-5 years back). I immediately responded that ChatGPT was simply “overfitting” in this case.

Since this was a rather popular online meme, and a lot of tweets would have been part of ChatGPT’s training data, coming up with this “meme-y joke” was basically the algorithm remembering this exact pattern that occurred multiple times in the training set. There was no need to intuit or interpolate or hallucinate – the number of occurrences in the training set meant this was an “obvious joke”.

In that sense, muggoos are like badly trained pieces of artificial intelligence (well, I might argue that their intelligence IS artificial) – they haven’t learnt the concepts, so they are unable to be creative or hallucinate. However, they have been “trained” very very well on the stuff that is there in the textbooks (and other reading material) – and the moment they see part of that it’s easy for them to “complete the sentences”. So when questions in the exams come straight out of the reading materials (as they do in a LOT of indian universities and school boards) they find it easy to answer.

However, when tested on “concepts”, they now need to intuit – and infer based on their understanding. In that sense, they are like badly trained machine learning models.

One of the biggest pitfalls in machine learning is “overfitting” – where you build a model that is so optimised to the training data that it learns quirks of the data that you don’t want it to learn. It performs superbly on the training dataset. Now, when faced with an unknown (“out of syllabus”) test set, it underperforms like crazy. In machine learning, we use techniques such as cross validation to make sure algorithms don’t overfit.

That, however, is not how the conventional Indian education system trains you – throughout most of the education, you find that the “test set” is a subset of the “training set” (questions in examinations come straight out of the textbook). Consequently, people with the ability to mug find that it is a winning strategy to just “overfit” and learn the textbooks verbatim – the likelihood of being caught out by unseen test data is minimal.

And then IF they get out into the real world, they find that a lot of the “test data” is unknown, and having not learnt to truly learn from the data, they struggle.

PS: Overfitting is not the only way machine learning systems misbehave. Sometimes they end up learning the entirely wrong pattern!

Category: computer science