A Year in Stack Overflow

teaching

Fri, 21 Nov 2014

365 Consecutive Days On Stack Overflow. Why?

There was a trend a while ago for photographers to announce they were doing ‘a 365’. They meant they were going to take a picture every day, and post it online. Soon their photostreams were full of badly instagrammed food and selfies.

I’ve recently completed a slightly different 365. Every day for the past 365 days I’ve checked the Stack Overflow site. StackOverflow (henceforth ‘SO’) is a question-and-answer site for programming questions. Its one of the Stack Exchange collection of question-and-answer sites for asking just about anything. Computing, language, TV and movies, parenting, religion, poker, and bicycles are just a few of the component sites.

At the 2011 R User Meeting I did a lightning talk suggesting we take R programming questions over to SO and leave the R-help mailing list for announcements and more opinionated forms of discussion. Although I can’t claim that to be the cause, traffic on the mailing list has declined and the number of questions tagged R is now over seventy thousand.

I think it was back in about March that I noticed I’d not missed a day on SO since November. A couple of months later I realised again I hadn’t broken that run. By the time I was off away to Bergen in Norway for a week in July I was consciously trying to keep it going. Thanks to hotel and airport Wifi I managed to check SO every day. I’ve probably only had a couple of weekends away this past year so its never been a problem. I can check SO on my phone over a data connection if I need. So I did. Then a couple of days ago the consecutive days counter clicked over to 365 and then to 366.

What does that all mean? Well, for one it indicates how connected we are. The internet was there every day for me for a year. Along with reading and writing on SO I checked my email, tweeted, read the news, and probably did some work too.

But what did I do on SO? I can check. SO has a data explorer where you can build a query and download a CSV file of metadata about your posts. Load that into R and summaries are easy:

I asked a grand total of five questions.
Four of them had answers I felt were acceptable.
I answered 300 questions.
The people who asked the questions accepted 171 of my answers as the solution.
There are 84 questions still waiting for the asker to accept an answer. That might never happen.
The remaining 45 had another answer accepted as the best answer.

I can also check the tags for the questions I’ve answered. Tags are used to categorise questions on SO, and a question can have several tags assigned. Of those 300 questions I answered:

there were 298 unique tags.
288 questions had the r tag.
10 had the python tag.
15 were tagged ggplot2 and/or plot
igraph had 9 tagged questions.

Then there’s a long long tail, with 223 tags making a single appearance, including things like ruby, julia-lang, java, and fortran.

As well as questions and answers, I’ve probably made even more comments on questions and answers. The quality can be appalling, but people soon learn the netiquette. A bunch of us hang out in the R chat room on SO, sending a flurry of downvotes or close-requests to anything that needs improving or removing.

I’ve occasionally popped over to some of the other SE sites. The GIS Stack Exchange is a good place to ask and learn about the latest mapping software. The Data Science site is however a bit rubbish, full of inappropriate questions or things that return opinionated answers or where answers would be impossible to squeeze into a text box. Its currently a ‘beta’ site on SE, and I don’t know if it will survive. Its greatest utility at the moment is as a place to dump poor machine-learning questions posted on SO.

I’ve learnt a lot this year. I find SO is a great place to learn, by answering questions more than asking them. You can also learn by seeing what the other answers are. This year has been one of great debate between the data.table crowd and the dplyr/pipes mob. Some people have been posting answers using base R code, data.table, and dplyr solutions, with benchmarking. The winner in this battle is all of us, since we have many possible solutions for an infinite number of problems.

I don’t know how long this run will last. Now I’ve written this up I could intentionally avoid SO tomorrow. But there might just be one interesting question…