A Numerical Look at My Own Blog



October 24th, 2020 by Diana Coman

From time to time, I review from various angles anything and everything, my work and habits included. As this blog made it through more than 10 years since the first articles that I wrote back in September 2009, I got recently quite curious as to what story its accumulated numbers might tell - and how far that story might turn out to be from the more forgiving picture that memory is bound to provide if left to be the sole narrator of things past.

With one quick query to the blog's database, the numbers were easily produced1, providing in a separate file a neat listing of each article's date and time of publication as well as character count, word count and link count:

select date(post_date) as articleDate, time(post_date) as articleTime, post_title as articleTitle, length(post_content) as charCount, (length(post_content) - length(replace(post_content, " ", "") ) + 1) as wordCount, (length(post_content) - length(replace(post_content, "a href", "") ) ) / length("a href") as linkCount from wp_posts where post_status="publish" and post_type="post" into outfile '/tmp/blognumbers.txt';

The file obtained with the above query2 is quickly3 loaded into R and then all charts and stats are a few commands away, ready to tell you whatever you want to know and quite a few things you'd rather... not know.

First of all though and as always with any data set whatsoever, there's some cleanup to do: in this case, I discarded all the logs articles (both #eulora and #ossasepia) since they are a sort of special category on their own and I'd rather not have them mixed with the rest. As a result - and highlighting already my low blog-output otherwise- the dataset shrinks significantly: from the initial 2334 articles including logs, I'm left with only 636 articles.

For the 636 articles remaining, the summary stats are those:

Minimum 1st Quartile Median Mean 3rd Quartile Maximum
Word Count 15 354 611 1149 1077 15107
Link Count 0 1 2 4.55 5 238

The first thing said by the stats above is that my shortest article is 15 words long. That sounds rather surprising to me on first pass, so let's look at what that wonder of an "article" might be, since it's just a matter of one quick select in R: the shortest article is... this one aka literally the statement of the question "how do you go about finding out that which you don't yet know you don't know?" All I can say is that now I'd make that article even shorter - the title is all that matters really, but other than that, I'd say those few words (15 or less, as they might be) still make quite an article in this case indeed: while there's a lot more that I could write in answer to that question, it's still... the question that matters and quite a lot, at that. Not a bad thing to be reminded of, either, as an unexpected gain at number-questioning my own blog.

Moving on, the rest of the numbers say that half my articles are between 354 and 1077 words in length, which is no surprise. Moreover, the fact that the median article is 611 words long while the average article is 1149 words long is again quite as expected - there are some really long articles that pull that average up but it took me otherwise a long, long time (and way more effort than the blog itself shows) to get into the habit of expanding rather than forever contracting my writing until there would almost be nothing left to write at all. So I'd expect indeed that the length of articles is perhaps more consistently 1000+ words only over the past few years, but not earlier. It's worth noting also that more words by themselves is just verbosity so more of a minus rather than a plus in itself but while I'm focusing on the numbers in this article, it does *not* mean that it's only those numbers that I'm relying on when considering the increase in length of my articles as an improvement of my own writing rather than the opposite of that. At any rate, since I still write relatively few articles anyway, I expect it will take quite some time until this change towards writing consistently longer articles shows more marked in the overall summary.

The summary of links count as it stands doesn't say perhaps all that much except that there is clearly some outlier(s) with a high count of links - as indeed there is, a review article of 200+ articles, linking therefore to each and every one of those. Nevertheless, taking out the outliers doesn't change anything other than the maximum value itself (the 2nd highest is 63 links) so I'll let it be as it stands.

The full graphs for those 636 articles can be plotted as well, to finally have some sort of pictures in this article, too:

blog_stats_2_640.jpg

blog_stats_1_640.jpg

Looking at the graphs above, 2018 seems to mark a clear change, with both word count and links more settled at a higher level than before. At least knowing my own history otherwise, the expected cvasi-break around 2013 is visible too - if anything, I'm almost surprised that it doesn't look even worse than that, I'd have almost expected a nearly total flat-line there. Then again, it's clearly the only place where such flat-line is actually visible on some intervals at all - if there was a point where the blog could have stopped, it's there but it turns out it was indeed a temporary break rather than the sort of "temporary" that becomes permanent.

To see more clearly the woods rather than all those individual trees making the charts above, I calculated some aggregated values on yearly intervals as well - after all, why not do it if I already got all the data and all the powerful R just waiting to help. Plotting the number of articles per year removes all doubt at least - there was barely any writing in 2013 and 2014, indeed. The surprise comes at the peak in 2011 and perhaps not as much at it being then as at it being so much higher than anything else:

blog_stats_3_640.jpg

One single number though, no matter how chosen, barely gives at any time anything other than a quick but extremely, extremely narrow (focused, if you prefer) view of anything at all. To add to it a few more similarly narrow but at least differently angled views of it, here's the median word count and link count of articles for each year:

blog_stats_4_640.jpg
blog_stats_5_640.jpg

While the overall clear upwards (if slow) trend is quite obvious to me and matching otherwise everything else4, the peak in 2018 (or dip in 2019, if you prefer) seemed at first rather surprising to me - from my point of view otherwise, there isn't/wasn't anything all that special with either of the years involved. But then looking at the blog's archives, it turns out that the peak in 2018 can be easily traced to ... the EuCrypt work (that I didn't otherwise assign mentally to 2018, at all, go figure!). So on one hand there is indeed an overall visible, if slow, improvement but on the other hand it's also rather clear that I still write the most when some project quite requires it as such - not that I'm all that surprised at it or anything.

After those 1500+ words that I could write today quite easily5 out of a few numbers that I obtained again without much trouble, I'd rather say it plainly also that I don't hold those numbers to mean *by themselves* all that much: they are certainly very useful in their unflinching precision but that very same precision comes at the cost of narrowing the perspective so much as to make it almost a single point, effectively blinding the naive (or the fool, if you prefer, there aren't really all that many different words to describe accurately the one who relies on "just the data") to all the complexity that actually yields that specific point in that specific place. So don't discard the numbers when they are honest measurements for they don't lie but don't expect of them more than they can offer, either: data points, even when aggregated or otherwise put through any intricate analysis are still just that, mere points, not by themselves models of the underlying reality.


  1. And then they threw me on to an entirely unexpected direction as I got to discover a silently broken article on the blog that I proceeded therefore to fix. 

  2. To count the words, the query basically calculates the number of spaces as difference between the length of the article as it is and the length of the article without spaces and adds 1 to that. Counting the links is done in a similar manner, only taking away the length of the article without "a href" and dividing the result by the length of "a href" itself. 

  3. Quickly enough at least: since mysql spit the values in tab-separated format and my titles contain all sorts including apostrophes and #, the read.table command of R needed some parameters (sep aka separator, quotes and comments) set to something other than the default values but R's help is clear enough and always at hand so the whole thing really took 5 minutes. 

  4. This part is more important than it seems - if you look long and hard enough, you *can* find *anything* in some data. Whether it also means something and moreover whether it means what you think is quite a different matter entirely. 

  5. And this says way more and is more important than all the numbers in this piece taken together. 

Comments feed: RSS 2.0

2 Responses to “A Numerical Look at My Own Blog”

  1. Mircea Popescu says:

    > The surprise comes at the peak in 2011

    Fain, keks.

  2. Diana Coman says:

    To some extent it is Fain-effect indeed although a significant part there is in fact my Bac-data too, which had nothing to do with Fain really. And to be honest it's still the part that endured more than the Fain flash thing.

    It's worth saying also that the 2011 peak is a peak in terms of number of articles but not that much of a peak if one looks beyond that - even leaving aside the length of articles as such, the whole still looks flimsier to me than the 2018+ "lower" stage. There is also more history to that data though - for one thing it's pretty much 2010-2012 when I wrote otherwise between 500-1000 words daily without fail, exactly having had enough of this "can't write" shit. It clearly worked in that I basically got better at writing but it didn't necessarily do much to the other side of it aka "write on blog instead of digging deeper into something interesting".

Leave a Reply