Over the weekend, I read Ben Blatt’s Nabokov’s Favourite Word Is Mauve, a simple natural language processing based analysis of hundreds of popular authors and their books. In this, Blatt uses several measures of goodness or badness of writing, and then measures different authors by it.
So he finds, for example, that Danielle Steel opens a lot of her books by talking about the weather, or that Charles Dickens uses a lot of “anaphora” (anyone who remembers the opening of A Tale of Two Cities shouldn’t be surprised by that). He also talks about the use of simple word counts to detect authorship of unknown documents (a separate post will come on that soon).
As someone who has already written a book (albeit nonfiction), I found a lot of this rather interesting, and constantly found myself trying to evaluate myself on the metrics with which Blatt subjected the famous authors to. And one metric that I found especially interesting was the “Flesch-Kincaid grade level“, which is a measure of complexity of language in a work.
It is a fairly simple formula, based on a linear combination of the average number of words per sentence and the average number of syllables per word. The formula goes like this:
And the result of the formula tells the approximate school grade of a reader who will be able to understand your writing. As you see, it is not a complex formula, and the shorter your sentences and shorter your words (measured in syllables), the simpler your prose is supposed to be.
The simplest works by this metric as mentioned in Blatt’s book are the works of Dr. Seuss, such as The Cat in the Hat or Green Eggs and Spam, on account of the exclusive usage of a small set of words in both books (Dr Seuss wrote the latter as a challenge, not unlike the challenges we would pose each other during “class participation” in business school). These books have a negative grade score, technically indicating that even a nursery kid should be able to read them, but actually meaning they’re simply easy to read.
Since the Flesch Kincaid Grade Score is based on a simple set of parameters (word count, sentence count and syllable count), it was rather simple for me to implement that on the posts from this blog.
I downloaded an XML export of all posts (I took this dump some two or three weeks ago), and then used R, with the Tidytext package to analyse the posts. Word count was most straightforward, using the str_count function in the stringr package (part of the tidyverse). Sentence count was a bit more complicated – there were no ready algorithms. I instead just searched for “sentence enders” (., ?, !, etc. I know the use of . in abbreviations creates problems but I can live with that).
Syllable count was the hardest. Again, there are some packages but it’s incredibly hard to use. Finally after much searching, I came across some code that again approximates this and used it.
Now that the technical stuff is done with, let’s get to the content. This word count, sentence count and syllable count all flow into calculating the Flesch-Kincaid (FK) score, which is the approximate class that one needs to be in to understand the text. Let’s just plot the FK score for all my blog posts (a total of 2341 of them) against time. I’ve added a regression line for good effect.
The trend is pretty clear. Over time, this blog has become more complicated and harder to read. In fact, drawing this graph slightly differently gives another message. This time, instead of a regression line, I’ve drawn a curve showing the trend.
When I started writing in 2004, I was at a 5th standard level. This increased steadily for the first two years (I gained a lot of my steady readership in this time) to get to about 8th standard, and plateaued there for a bit. And then again around 2009-10 there was n increase, as my blog got up to the 10th standard level. It’s pretty much stayed there ever since, apart from a tiny bump up in the end of 2014.
I don’t know if this increase in “complexity” of my blog is a good or a bad thing. On the one hand, it shows growing up. On the other, it’s becoming tougher to read, which has probably coincided with a plateauing (or even a drop) in the readership as well.
Let me know what you think – if you prefer this “grown up style”, or if you want to go back to the more simple writing I started off with.
Where is the concept of topic in the formula? I would think while syllables and words represent complexity of language, there should be a variable to represent topic / subject and their understandability and interest level for readers. No?