On Sunday evening, we were driving to a relative’s place in Mahalakshmi Layout when I almost missed a turn. And then I was about to miss another turn and my wife said “how bad are you with directions? You don’t even know where to turn!”.
“Well, this is your area”, I told her (she grew up in Rajajinagar). “I had very little clue of this part of town till I married you, so it’s no surprise I don’t know how to go to your cousin’s place”.
“But they moved into this house like six months ago, and every time we’ve gone there together. So if I know the route, why can’t you”, she retorted.
This gave me a trigger to go off on a rant on pre-trained models, and I’m going to inflict that on you now.
For a long time, I didn’t understand what the big deal was on pre-trained machine learning models. “If it’s trained on some other data, how will it even work with my data”, I wondered. And then recently I started using GPT4 and other similar large language models. And I started reading blogposts on how with very little finetuning these models can do “gymnastics”.
Having grown up in North Bangalore, my wife has a “pretrained model” of that part of town in her head. This means she has sufficient domain knowledge, even if she doesn’t have any specific knowledge. Now, with a small amount of new specific information (the way to her cousins’s new house, for example), it is easy for her to fit in the specific information to her generic knowledge and get a clear idea on how to get there.
(PS: I’m not at all suggesting that my wife’s intelligence is artificial here)
On the other hand, my domain knowledge of North Bangalore is rather weak, despite having lived there for two years. For the longest time, Mallewaram was a Chakravyuha – I would know how to go there, but not how to get back. Given this lack of domain knowledge, the little information on the way to my wife’s cousin’s new house is not sufficient for me to find my way there.
It is similar with machines. LLMs and other pre-trained models have sufficient “generic domain knowledge” in lots of things, thanks to the large amounts of data they’ve been trained on. As a consequence, if you can train them on fairly small samples of specific data, they are able to generalise around this specific data and learn around them.
More pertinently, in real life, depending upon our “generic domain knowledge” of different domains, the amount of information that you and I will need to learn a certain amount about a certain domain can be very very different.
I’m writing this five minutes after making my wife’s “coffee decoction” using the Bialetti Moka pot. I don’t like chicory coffee early in the morning, and I’m trying to not have coffee soon after I wake up, so I haven’t made mine yet.
While I was filling the coffee into the Moka Pot, I was thinking of the concept of channelling. Basically, if you try to pack the moka pot too tight with coffee powder, then the steam (that goes through the beans, thus extracting the caffeine) takes the easy way out – it tries to create a coffee-less channel to pass through, rather than do the hard work of extracting coffee as it passes through the layer of coffee.
I’m talking about steam here – water vapour, to be precise. It is as lifeless as it could get. It is the gaseous form of a colourless odourless shapeless liquid. Yet, it shows the seeming “intelligence” of taking the easy way out. Fundamentally this is just physics.
This is not an isolated case. Last week, at work, I was wondering why some algorithm was returning a “negative cost” (I’m using local search for that, and after a few iterations, I found that the algorithm is rapidly taking the cost – which is supposed to be strictly positive – into deep negative territory). Upon careful investigation (thankfully it didn’t take too long), it transpired that there was a penalty cost which increased non-linearly with some parameter. And the algo had “figured” that if this parameter went really high, the penalty cost would go negative (basically I hadn’t done a good job of defining the penalty well). And so would take this channel.
Again, this algorithm has none of the supposedly scary “AI” or “ML” in it. It is a good old rule-based system, where I’ve defined all the parameters and only the hard work of finding the optimal solution is left to the algo. And yet, it “channelled”.
Basically, you don’t need to have got a good reason for taking the easy way out now. It is not even human, or “animal” to do that – it is simply a physical fact. When there exists an easier path, you simply take that – whether you are an “AI” or an algorithm or just steam!
I’ll leave you with this algo that decided to recognise sheep by looking for meadows (this is rather old stuff).
When I’m visiting someone’s house and they have an accessible bookshelf, one of the things I do is to go check out the books they have. There is no particular motivation, but it’s just become a habit. Sometimes it serves as conversation starters (or digressors). Sometimes it helps me understand them better. Most of the time it’s just entertaining.
So at a friend’s party last night, I found this book on Graph Theory. I just asked my hosts whose book it was, got the answer and put it back.
As many of you know, whenever we host a party, we use graph theory to prepare the guest list. My learning from last night’s party, though, is that you should not only use graph theory to decide WHO to invite, but also to adjust the times you tell people so that the party has the best outcome possible for most people.
With the full benefit of hindsight, the social network at last night’s party looked approximately like this. Rather, this is my interpretation of the social network based on my knowledge of people’s affiliation networks.
This is approximate, and I’ve collapsed each family to one dot. Basically it was one very large clique, and two or three other families (I told you this was approximate) who were largely only known to the hosts. We were one of the families that were not part of the large clique.
This was not the first such party I was attending, btw. I remember this other party from 2018 or so which was almost identical in terms of the social network – one very large clique, and then a handful of families only known to the hosts. In fact, as it happens, the large clique from the 2018 party and from yesterday’s party were from the same affiliation network, but that is only a coincidence.
Thinking about it, we ended up rather enjoying ourselves at last night’s party. I remember getting comfortable fairly quickly, and that mood carrying on through the evening. Conversations were mostly fun, and I found myself connecting adequately with most other guests. There was no need to get drunk. As we drove back peacefully in the night, my wife and daughter echoed my sentiments about the party – they had enjoyed themselves as well.
This was in marked contrast with the 2018 party with the largely similar social network structure (and dominant affiliation network). There we had found ourselves rather disconnected, unable to make conversation with anyone. Again, all three of us had felt similarly. So what was different yesterday compared to the 2018 party?
I think it had to do with the order of arrival. Yesterday, we were the second family to arrive at the party, and from a strict affiliation group perspective, the family that had preceded us at the party wasn’t part of the large clique affiliation network (though they knew most of the clique from beforehand). In that sense, we started the party on an equal footing – us, the hosts and this other family, with no subgroup dominating.
The conversation had already started flowing among the adults (the kids were in a separate room) when the next set of guests (some of them from the large clique arrived), and the assimilation was seamless. Soon everyone else arrived as well.
The point I’m trying to make here is that because the non-large-clique guests had arrived first, they had had a chance to settle into the party before the clique came in. This meant that they (non-clique) had had a chance to settle down without letting the party get too cliquey. That worked out brilliantly.
In contrast, in the 2018 party, we had ended up going rather late which meant that the clique was already in action, and a lot of the conversation had been clique-specific. This meant that we had struggled to fit in and never really settled, and just went through the motions and returned.
I’m reminded of another party WE had hosted back in 2012, where there was a large clique and a small clique. The small clique had arrived first, and by the theory in this post, should have assimilated well into the party. However, as the large clique came in, the small clique had sort of ended up withdrawing into itself, and I remember having had to make an effort to balance the conversation between all guests, and it not being particularly stress-free for me.
The difference there was that there were TWO cliques with me as cut-vertex. Yesterday, if you took out the hosts (cut-vertex), you would largely have one large clique and a few isolated nodes. And the isolated nodes coming in first meant they assimilated both with one another and with the party overall, and the party went well!
And now that I’ve figured out this principle, I might break my head further at the next party I host – in terms of what time I tell to different guests!
On Saturday morning, my daughter had made some nice art with sketch pen on an A4 paper. It was rather “geometric” consisting of repeating patterns across the page. My wife took one look at it and said, “do you know that you can make such art with computers also? Your father has made some”.
Some drawings I had made using code, back in 2016
“Reallly?”, piped the daughter. I had been intending for a while to start teaching her to code (she is six), and figured this was the perfect trigger, and said I will teach her.
A quick search revealed that there is an “ACS Logo” for Mac (Logo was the first “programming language” I had learnt, when I was nine). I quickly downloaded it on her computer (my wife’s old Macbook Air) and figured I remembered most of the commands.
And then I started typing, and showed her what they had showed me back in a “computer class” behind my house in 1992 – FD for “forward”. RT for right turn. HT for hide turtle. Etc. Etc.
Soon she was engrossed in it. Thankfully she has learnt angles in her school, though it took her some trial and error to figure out how much to turn by for different shapes (later I was thinking this can also serve as a good “angles revision” for her during her ongoing summer holidays).
With my wife having reminded me that I could produce images through code, I realised that as my daughter was engrossed in her “coding”, I should do some “coding art” on my own. All she needed was some occasional input, and for me to sit right next to her.
Last Monday I had got a bit of a scare – at work, I needed to generate randomly distributed points in a regular hexagon. A lookup online told me that I could just get a larger number of randomly distributed points in a bounding rectangle, and then only pick points within the hexagon. And then take a random sample of those.
This had meant that I needed to write equations for whether a point lay inside a hexagon. And I realised I’d forgotten ALL my coordinate geometry. It took me over half an hour to get the equation for the sides of the hexagon right – I’m clearly rusty.
And on Saturday, as I sat down to make some “computer art”, I decided I’ll make some fractals. Why don’t I make some Sierpinski Triangles, I thought. I started breaking down what code I needed to write.
First, given an equilateral triangle, I had to return three similar equilateral triangles, each of half the side length of the original triangles.
Then, given the centroid of an equilateral triangle and the length of each side, I had to return the vertices.
Once these two functions had been written, I could just chain them (after running the first one recursively) and then had to just plot to get the Sierpinski triangle.
And then I had my second scare of the week – not only had I forgotten my coordinate geometry – I had forgotten my trigonometry as well. Again I messed up a few times, but the good thing about programming with a computer is that i could do trial and error. Soon I had it right, and started producing Sierpinski triangles.
Then, there was another problem – my code was really inefficient. If I went beyond depth 4 or 5, the figures would take inordinately long to render. Since I was coding in R, I set about vectorising all my code. In R you don’t write loops if you can help it – instead, you apply functions on entire vectors. This again took some time, and then I had the triangles ready. I proudly showed them off to my daughter.
“Appa, why is it that as you increase the number it becomes greyer”, she asked . I explained how with each step, you were taking away more of the filled areas from the triangles. Then I figured this wasn’t that good-looking – maybe I should colour it.
And so I wrote code to colour the triangles. Basically, I started recursively colouring them – the top third green, left third red and right third blue (starting with a red base). This is what I ended up producing:
And this is what my daughter produced at the same time, using Logo:
I forgot to “HT” before taking the screenshot. This is a “lollipop”
As many of the regular readers of this blog know, I largely use R for most of my data work. There have been a few occasions when I’ve tried to use Python, but have found that I’m far less efficient in that than I am with R, and so abandoned it, despite the relative ease of putting things into production.
Now in my company, like in most companies, people use both Python and R (the team that reports to me largely uses R, everyone else largely uses Python). And while till recently I used to claim that I’m multilingual in the sense that I can read Python code fairly competently, of late I’m not sure I am. I find it increasingly difficult to parse and read production grade python code.
And now, after some experiments with ChatGPT, and exploring other people’s codes, I have an idea on why I’m finding it hard to read production-grade Python code. It has to do with “code density”.
Of late I’ve been experimenting with Spark (finally, in this job I do a lot of “big data” work – something I never had to in my consulting career prior to this). Related to this, I was reading someone’s PySpark code.
And then it hit me – the problem (rather, my problem) with Python is that it is far more verbose than R. The number of characters or lines of code required to do something in Python is far more than what you need in R (especially if you are using the tidyverse family of packages, which I do, including sparklyr for spark).
Why does the density of code matter? It is to do with aesthetics and modularity and ease of understanding.
Yesterday I was writing some code that I plan to put into production. After a few hours of coding, I looked at the code and felt disgusted with myself – it was a way too long monolithic block of code. It might have been good when I was writing it, but I knew that if I were to revisit it in a week or two, I wouldn’t be able to understand what the hell was happening there.
I’ve never worked as a professional software engineer, but with the amount of coding I’ve done, I’ve worked out what is a “reasonable length for a code block”. It’s like that apocryphal story of Indian public examiners for high school exams who evaluate history answers based on how long they are – “if they were to place an ordinary Reynolds 045 pen vertically on the sheet, the answer should be longer than that for the student to get five marks”.
An answer in a high school history exam needs to be longer than this. A code block or function should be shorter than this
It’s the reverse here. Approximately speaking, if you were to place a Reynolds pen vertically on screen (at your normal font size), no function (or code block) can be longer than the pen.
This easily approximates how much the eye can see on one normal Macbook screen (I use a massive external monitor at work, and a less massive, but equally wide, one at home). If you have to keep scrolling up and down to understand the full logic, there is a higher chance you might make mistakes, and higher difficulty for someone to understand the code.
Till recently (as in earlier this week) I would crib like crazy that people coding in Python would make their code “too modular”. That I would have to keep switching back and forth between different functions (and classes!!) to make sense of some logic, and about how that would make codes hard to debug (I still think there is a limit to how modular you can make your ETL code).
Now, however (I’m writing this on a Saturday – I’m not working today), from the code density perspective, it all makes sense to me.
The advantage of R is that because the code is far denser, you can pack in a far greater amount of logic in a Reynolds pen length of code. So over the years I’ve gotten used to having this much logic being presented to me in one chunk (without having to scroll or switch functions).
The relatively lower density of Python, however, means that the same amount of logic that would be one function in R is now split across several different functions. It is not that the writer of the code is “making this too modular” or “writing functions just for the heck of it”. It is just that their “mental Reynolds pens” doesn’t allow them to pack in more lines in a chunk or function, and Python’s density means there is only so much logic that can go in there.
As part of my undergrad, I did a course on Software Engineering (and the one thing the professor told us then was that we should NOT take up software engineering as a career – “it’s a boring job”, he had said). In that, one of the things we learnt was that in conventional software services contexts, billing would happen as a (nonlinear) function of “kilo lines of code” (this was in early 2003).
Now, looking back, one thing I can say is that the rate per kilo line of R code ought to be much higher than the rate per kilo line of Python code.
Back in 2000, I entered the Computer Science undergrad program at IIT Madras thinking I was a fairly competent coder. In my high school, I had a pretty good reputation in terms of my programming skills and had built a whole bunch of games.
By the time half the course was done I had completely fallen out of love with programming, deciding a career in Computer Science was not for me. I even ignored Kama (current diro)’s advice and went on to do an MBA.
He had given me a 3 hour lecture about how I'm wasting my life when I happened to tell him that I was preparing for CAT. https://t.co/1rWG0BnymW
What had happened? Basically it was a sudden increase in the steepness of the learning curve. Or that I’m a massive sucker for user experience, which the Computer Science program didn’t care for.
Back in school, my IDE of choice (rather the only one available) was TurboC, a DOS-based program. You would write your code, and then hit Ctrl+F9 to run the program. And it would just run. I didn’t have to deal with any technical issues. Looking back, we had built some fairly complex programs just using TurboC.
And then I went to IIT and found that there was no TurboC, no DOS. Most computers there had an ancient version of Unix (or worse, Solaris). These didn’t come with nice IDEs such as TurboC. Instead, you had to use vi (some of the computers were so old they didn’t even have vim) to write the code, and then compile it from outside.
Difficulties in coming to terms with vi meant that my speed of typing dropped. I couldn’t “code at the speed of thought” any more. This was the first give up moment.
Then, I discovered that C++ had now got this new set of “standard template libraries” (STL) with vectors and stuff. This was very alien to the way I had learnt C++ in school. Also I found that some of my classmates were very proficient with this, and I just couldn’t keep up with this. The effort seemed too much (and the general workload of the program was so high that I couldn’t get much time for “learning by myself”), so I gave up once again.
Next, I figured that a lot of my professors were suckers for graphic UIs (though they denied us good UX by denying us good computers). This, circa 2001-2, meant programming in Java and writing applets. It was a massive degree of complexity (and “boringness”) compared to the crisp C/C++ code I was used to writing. I gave up yet again.
I wasn’t done with giving up yet. Beyond all of this, there was “systems programming”. You had to write some network layers and stuff. You had to go deep into the workings of the computer system to get your code to run. This came rather intuitively to most of my engineering-minded classmates. It didn’t to me (programming in C was the “deepest” I could grok). And I gave up even more.
A decade after I graduated from IIT Madras, I visited IIM Calcutta to deliver a lecture. And got educated.
I did my B.Tech. project in “theoretical computer science”, managed to graduate and went on to do an MBA. Just before my MBA, I was helping my father with some work, and he figured I sucked at Excel. “What is the use of completing a B.Tech. in computer science if you can’t even do simple Excel stuff?”, he thundered.
In IIMB, all of us bought computers with pirated Windows and Office. I started using Excel. It was an absolute joy. It was a decade before I started using Apple products, but the UX of Windows was such a massive upgrade compared to what I’d seen in my more technical life.
In my first job (where I didn’t last long) I learnt the absolute joy of Visual Basic macros for Excel. This was another level of unlock. I did some insane gymnastics in that. I pissed off a lot of people in my second job by replicating what they thought was a complex model on an Excel sheet. In my third job, I replaced a guy on my team with an Excel macro. My programming mojo was back.
Goldman Sachs’s SLANG was even better. By the time I left from there, I had learnt R as well. And then I became a “data scientist”. People asked me to use Python. I struggled with it. After the user experience of R, this was too complex. This brought back bad memories of all the systems programming and dealing with bad UX I had encountered in my undergrad. This time I was in control (I was a freelancer) so I didn’t need to give up – I was able to get all my work done in R.
The second giving up
I’ve happily used R for most of my data work in the last decade. Recently at work I started using Databricks (still write my code in R there, using sparklyr), and I’m quite liking that as well. However, in the last 3-4 months there has been a lot of developments in “AI”, which I’ve wanted to explore.
The unfortunate thing is that most of this is available only in Python. And the bad UX problem is back again.
Initially I got excited, and managed to install Stable Diffusion on my personal Mac. I started writing some OpenAI code as well (largely using R). I started tracking developments in artificial intelligence, and trying some of them out.
And now, in the last 2-3 weeks, I’ve been struggling with “virtual environments”. Each newfangled open-source AI that is released comes with its own codebase and package requirements. They are all mutually incompatible. You install one package, and you break another package.
The “solution” to this, from what I could gather, is to use virtual environments – basically a sandbox for each of these things that I’ve downloaded. That, I find, is grossly inadequate. One of the points of using open source software is to experiment with connecting up two or more of them. And if each needs to be in its own sandbox, how is one supposed to do this? And how are all other data scientists and software engineers okay with this?
This whole virtual environment mess means that I’m giving up on programming once again. I won’t fully give up – I’ll continue to use R for most of my data work (including my job), but I’m on the verge of giving up in terms of these “complex AI”.
It’s the UX thing all over again. I simply can’t handle bad UX. Maybe it’s my ADHD. But when something is hard to use, I simply don’t want to use it.
And so I’m giving up once again. Tagore was right.
Back when I was a student, there was this (rather large) species of students who we used to call “muggoos”. They were called that because they would have a habit of “mugging up the answers” – basically they would learn verbatim stuff in the textbooks and other reading material, and then just spit it out during the exams.
They were incredibly hardworking, of course – since the volume of stuff to mug was immense – and they would make up for their general lack of understanding of the concepts with their massive memories and rote learning.
On average, they did rather well – with all that mugging, the downside was floored. However, they would stumble badly in case of any “open book exams” (where we would be allowed to carry textbooks into the exams) – since the value of mugging there was severely limited. I remember having an argument once with some topper-type muggoos (with generally much better grades than me ) on whether to keep exams in a particular course open book or closed book. They all wanted closed book of course.
This morning, I happened to remember this species while chatting with a friend. He was sending me some screenshots from ChatGPT and was marvelling at something which it supposedly made up (I remembered it as a popular meme from 4-5 years back). I immediately responded that ChatGPT was simply “overfitting” in this case.
Since this was a rather popular online meme, and a lot of tweets would have been part of ChatGPT’s training data, coming up with this “meme-y joke” was basically the algorithm remembering this exact pattern that occurred multiple times in the training set. There was no need to intuit or interpolate or hallucinate – the number of occurrences in the training set meant this was an “obvious joke”.
In that sense, muggoos are like badly trained pieces of artificial intelligence (well, I might argue that their intelligence IS artificial) – they haven’t learnt the concepts, so they are unable to be creative or hallucinate. However, they have been “trained” very very well on the stuff that is there in the textbooks (and other reading material) – and the moment they see part of that it’s easy for them to “complete the sentences”. So when questions in the exams come straight out of the reading materials (as they do in a LOT of indian universities and school boards) they find it easy to answer.
However, when tested on “concepts”, they now need to intuit – and infer based on their understanding. In that sense, they are like badly trained machine learning models.
One of the biggest pitfalls in machine learning is “overfitting” – where you build a model that is so optimised to the training data that it learns quirks of the data that you don’t want it to learn. It performs superbly on the training dataset. Now, when faced with an unknown (“out of syllabus”) test set, it underperforms like crazy. In machine learning, we use techniques such as cross validation to make sure algorithms don’t overfit.
That, however, is not how the conventional Indian education system trains you – throughout most of the education, you find that the “test set” is a subset of the “training set” (questions in examinations come straight out of the textbook). Consequently, people with the ability to mug find that it is a winning strategy to just “overfit” and learn the textbooks verbatim – the likelihood of being caught out by unseen test data is minimal.
And then IF they get out into the real world, they find that a lot of the “test data” is unknown, and having not learnt to truly learn from the data, they struggle.
Last evening we hosted a party at home. Like all parties we host, we used Graph Theory to plan this one. This time, however, we used graph theory in a very different way to how we normally use it – our intent was to avoid large cliques. And, looking back, I think it worked.
First, some back story. For some 3-4 months now we’ve been planning to have a party at home. There has been no real occasion accompanying it – we’ve just wanted to have a party for the heck of it, and to meet a few people.
The moment we started planning, my wife declared “you are the relatively more extrovert among the two of us, so organising this is your responsibility”. I duly put NED. She even wrote a newsletter about it.
The gamechanger was this podcast episode I listened to last month.
The episode, like a lot of podcast episodes, is related to this book that the guest has written. Something went off in my head as I listened to this episode on my way to work one day.
The biggest “bingo” moment was that this was going to be a strictly 2-hour party (well, we did 2.5 hours last night). In other words, “limited liability”!!
One of my biggest issues about having parties at my house is that sometimes guests tend to linger on, and there is no “defined end time”. For someone with limited social skills, this can be far more important than you think.
The next bingo was that this would be a “cocktail” party (meaning, no main course food). Again that massively brought down the cost of hosting – no planning menus, no messy food that would make the floor dirty, no hassles of cleaning up, and (most importantly) you could stick to your 2 / 2.5 hour limit without any “blockers”.
Listen to the whole episode. There are other tips and tricks, some of which I had internalised ahead of yesterday’s party. And then came the matter of the guest list.
I’ve always used graph theory (coincidentally my favourite subject from my undergrad) while planning parties. Typical use cases have been to ensure that the graph is connected (everyone knows at least one other person) and that there are no “cut vertices” (you don’t want the graph to get disconnected if one person doesn’t turn up).
This time we used it in another way – we wanted the graph to be connected but not too connected! The idea was that if there are small groups of guests who know each other too well, then they will spend the entirety of the party hanging out with each other, and not add value to the rest of the group.
Related to this was the fact that we had pre-decided that this party is not going to be a one-off, and we will host regularly. This made it easier to leave out people – we could always invite them the next time. Again, it is important that the party was “occasion-less” – if it is a birthday party or graduation party or wedding party or some such, people might feel offended that you left them out. Here, because we know we are going to do this regularly, we know “everyone’s number will come sometime”.
I remember the day we make the guest list. “If we invite X and Y, we cannot invite Z since she knows both X and Y too well”. “OK let’s leave out Z then”. “Take this guy’s name off the list, else there will be too many people from this hostel”. “I’ve met these two together several times, so we can call exactly one of them”. And so on.
With the benefit of hindsight, it went well. Everyone who said they will turn up turned up. There were fourteen adults (including us), which meant that there were at least three groups of conversation at any point in time – the “anti two pizza rule” I’ve written about. So a lot of people spoke to a lot of other people, and it was easy to move across groups.
I had promised to serve wine and kODbaLe, and kept it – kODbaLe is a fantastic party food in that it is large enough that you don’t eat too many in the course of an evening, and it doesn’t mess up your fingers. So no need of plates, and very little use of tissues. The wine was served in paper cups.
I wasn’t very good at keeping up timelines – maybe I drank too much wine. The party was supposed to end at 7:30, but it was 7:45 when I banged a spoon on a plate to get everyone’s attention and inform them that the party was over. In another ten minutes, everyone had left.
For a long time I have had this shibboleth on whether someone is a “statistics person or a machine learning person”. It is based on what they call regressions where the dependent variable is binary. Statisticians simply call it “logit” (there is also a “probit“).
Now, in terms of implementation as well, there is one big difference between the way “logit” is modelled versus “logistic regression”. For a logit model (if you are using python, you need to use the “statsmodels” package for this, not scikit learn), the number of observations needs to far exceed the number of independent variables.
Else, a matrix that needs to be inverted as part of the solution will turn out to be singular, and there will be no solution. I guess I betrayed my greater background in statistics than in Machine Learning when, in 2018, I wrote this blogpost on machine learning being a “process to tie down coefficients in maths models“.
For “logistic regression” (as opposed to “logit”) puts no such constraint – on the regression matrix being invertible. Instead of actually inverting the matrix, machine learning approaches simply focus on learning the terms of the inverted matrix using gradient descent (basically the opposite of hill climbing), so mathematical inconveniences such as matrices that cannot be inverted are moot there.
And so you have logistic regression models with thousands of variables, often calibrated with a fewer number of data points. To be honest, I can’t understand this fully – without sufficient information (data points) to calibrate the coefficients, there will always be a sense of randomness in the output. The model has too many degrees of freedom, and so there is additional information the model is supplying (apart from what was supplied in the training data!).
Of late I have been playing a fair bit with generative AI (primarily ChatGPT and Stable Diffusion). The other day, my daughter and I were alone in my in-laws’ house, and I told her “look I’ve brought my personal laptop along, if you want we can play with it”. And she had demanded that she “play with stable diffusion”. This is the image she got for “tiger chasing deer”.
I have written earlier here about how the likes of ChatGPT and Stable Diffusion in a way redefine “information content“.
What the likes of Dall-E and Stable Diffusion have clearly shown is that a picture is worth far less than a thousand words
And if you think about it, almost by definition, “generative AI” creates information (and hallucinates, like in the above pic). Traditionally speaking, a “picture is worth a thousand words”, but if you can generate a picture with just a few words of prompt, the information content in it is far less than a thousand words.
In some sense, this reminds me of “logistic regression” once again. By definition (because it is generative), there is insufficient “tying down of coefficients”, because of which the AI inevitably ends up “adding value of its own”, which by definition is random.
So, you will end up getting arbitrary results. ChatGPT often gives you wrong answers to questions. Dall-E and Midjourney and Stable Diffusion will return nonsense images such as the above. Because a “generative AI” needs to create information, by definition, all the coefficients of the model cannot be well calibrated.
And the consequence of this is that however good these AIs get, however much data is used to train them, there will always be an element of randomness to them. There will always be test cases where they give funny results.
This morning, when I got back from the gym, my wife and daughter were playing 20 questions, with my wife having just taught my daughter the game.
Given that this was the first time they were playing, they started with guessing “2 digit numbers”. And when I came in, they were asking questions such as “is this number divisible by 6” etc.
To me this was obviously inefficient. “Binary search is “, I realised in my head, and decided this is a good time to teach my daughter binary search.
So for the next game, I volunteered to guess, and started with “is the number “? And went on to “is the number “, and got to the number in my wife’s mind (74) in exactly 7 guesses (and you might guess that (90 is the number of 2 digit numbers) is 7).
And so we moved on. Next, I “kept” 41, and my wife went through a rather random series of guesses (including “is it divisible by 4” fairly early on) to get in 8 tries. By this time I had been feeling massively proud, of putting to good use my computer science knowledge in real life.
“See, you keep saying that I’m not a good engineer. See how I’m using skills that I learnt in my engineering to do well in this game”, I exclaimed. My wife didn’t react.
It was finally my daughter’s turn to keep a number in mind, and my turn to guess.
“Is the number ?”
“Yes”
“Is the number ?”
“Yes”
“Is the number ?”
“Yes”
My wife started grinning. I ignored it and continued with my “process”, and I got to the right answer (99) in 6 tries. “You are stupid and know nothing”, said my wife. “As soon as she said it’s greater than 88, I knew it is 99. You might be good at computer science but I’m good at psychology”.
She had a point. And then I started thinking – basically the binary search method works under the assumption that the numbers are all uniformly distributed. Clearly, my wife had some superior information to me, which made 99 far more probable than any number between 89 and 98. And s0 when the answer to “Is the number ?”turned out to by “yes”, she made an educated guess that it’s 99.
And since I’m used to writing algorithms, and teaching dumb computers to solve problems, I used a process that didn’t make use of any educated guesses! And thus took far many more steps to get to the answer.
When the numbers don’t follow a uniform distribution, binary search works differently. You don’t start with the middle number – instead, you start with the weighted median of all the numbers! And then go on to the weighted median of whichever half you end up in. And so on and so forth until you find the number in the counterparty’s mind. That is the most optimal algo.
Then again, how do you figure out what the prior distribution of numbers is? For that, I guess knowing some psychology helps.