Stable Diffusion and Chat GPT and Logistic Regression

For a long time I have had this shibboleth on whether someone is a “statistics person or a machine learning person”. It is based on what they call regressions where the dependent variable is binary. Statisticians simply call it “logit” (there is also a “probit“).

Now, in terms of implementation as well, there is one big difference between the way “logit” is modelled versus “logistic regression”. For a logit model (if you are using python, you need to use the “statsmodels” package for this, not scikit learn), the number of observations needs to far exceed the number of independent variables.

Else, a matrix that needs to be inverted as part of the solution will turn out to be singular, and there will be no solution. I guess I betrayed my greater background in statistics than in Machine Learning when, in 2018, I wrote this blogpost on machine learning being a “process to tie down coefficients in maths models“.

For “logistic regression” (as opposed to “logit”) puts no such constraint – on the regression matrix being invertible. Instead of actually inverting the matrix, machine learning approaches simply focus on learning the terms of the inverted matrix using gradient descent (basically the opposite of hill climbing), so mathematical inconveniences such as matrices that cannot be inverted are moot there.

And so you have logistic regression models with thousands of variables, often calibrated with a fewer number of data points. To be honest, I can’t understand this fully – without sufficient information (data points) to calibrate the coefficients, there will always be a sense of randomness in the output. The model has too many degrees of freedom, and so there is additional information the model is supplying (apart from what was supplied in the training data!).

Of late I have been playing a fair bit with generative AI (primarily ChatGPT and Stable Diffusion). The other day, my daughter and I were alone in my in-laws’ house, and I told her “look I’ve brought my personal laptop along, if you want we can play with it”. And she had demanded that she “play with stable diffusion”. This is the image she got for “tiger chasing deer”.

I have written earlier here about how the likes of ChatGPT and Stable Diffusion in a way redefine “information content“.

 

And if you think about it, almost by definition, “generative AI” creates information (and hallucinates, like in the above pic). Traditionally speaking, a “picture is worth a thousand words”, but if you can generate a picture with just a few words of prompt, the information content in it is far less than a thousand words.

In some sense, this reminds me of “logistic regression” once again. By definition (because it is generative), there is insufficient “tying down of coefficients”, because of which the AI inevitably ends up “adding value of its own”, which by definition is random.

So, you will end up getting arbitrary results. ChatGPT often gives you wrong answers to questions. Dall-E and Midjourney and Stable Diffusion will return nonsense images such as the above. Because a “generative AI” needs to create information, by definition, all the coefficients of the model cannot be well calibrated. 

And the consequence of this is that however good these AIs get, however much data is used to train them, there will always be an element of randomness to them. There will always be test cases where they give funny results.

No, AGI is not here yet.

Why AI will always be biased

Out on Marginal Revolution, Alex Tabarrok has an excellent post on why “sexism and racism will never diminish“, even when people on the whole become less sexist and racist. The basic idea is that there is always a frontier – even when we all become less sexist or racist, there will be people who will  be more sexist or racist than the others and they will get called out as extremists.

To quote a paper that Tabarrok has quoted (I would’ve used a double block-quote for this if WordPress allowed it):

…When blue dots became rare, purple dots began to look blue; when threatening faces became rare, neutral faces began to appear threatening; and when unethical research proposals became rare, ambiguous research proposals began to seem unethical. This happened even when the change in the prevalence of instances was abrupt, even when participants were explicitly told that the prevalence of instances would change, and even when participants were instructed and paid to ignore these changes.

Elsewhere, Kaiser Fung has a nice post on some of his learnings from a recent conference on Artificial Intelligence that he attended. The entire post is good, and I’ll probably comment on it in detail in my next newsletter, but there is one part that reminded me of Tabarrok’s post – on bias in AI.

Quoting Fung (no, this is not a two-level quote. it’s from his blog post):

Another moment of the day is when one speaker turned to the conference organizer and said “It’s become obvious that we need to have a bias seminar. Have a single day focused on talking about bias in AI.” That was his reaction to yet another question from the audience about “how to eliminate bias from AI”.

As a statistician, I was curious to hear of the earnest belief that bias can be eliminated from AI. Food for thought: let’s say an algorithm is found to use race as a predictor and therefore it is racially biased. On discovering this bias, you remove the race data from the equation. But if you look at the differential impact on racial groups, it will still exhibit bias. That’s because most useful variables – like income, education, occupation, religion, what you do, who you know – are correlated with race.

This is exactly like what Tabarrok mentioned about humans being extremist in whatever way. You take out the most obvious biases, and the next level of biases will stand out. And so on ad infinatum.