Understanding how reviews of source code are done seem to be one of my main interests recently. Partly because reviews are important for software quality, while taking time. Partly, also, because I think it is interesting to check if we can quantify a good opinion from a good software developer.
In this paper, the authors study how to predict to which degree one can predict how many comments a given patch will have. Now, this problem may not be the most exciting ones, but it attracted my attention because of the fact that the authors studied the same projects as we did. However, to the contrary of our work, they also take into consideration features that characterize software developer networks – for example the experience in commenting software patches or the networking.
Now, to the results. The models presented in this paper seem to be quite good in quantifying and predicting patches within the same project – all kinds of predictions have pretty good F1-scores, above 65%. This means that we can train these models for our own projects and to be able to predict whether a particular patch will be commented on once or twice, or even many times.
The performance of the model on the cross-projects dataset. There, the performance is ok for predicting whether a particular patch will be commented on. Predicting how many times the patch will be commented on, or even that it will be commented on many times, does not work very well. The magnitude of the performance measures oscillates close to the 0% mark, which means that the models are not better than just guessing. I guess, you cannot have it all… from one model.
To sum up, the reason I read this article in more detail than others, was essentially not the performance, but tryin to understand the underlying techniques which they use. I’d like to say that they use a good set of features, which I recommend other to use (and will definitely use myself in the next studies), and the fact that they use simple language models, like word2vec, to understand the programming language. What I lack, though, is scrutinizing whether there is a statistically significant dependency between the sentiment (or even its strength) and the length of the discussion.