Consistency in code reviews (article review)

BIld av press 👍 and ⭐ från Pixabay

tse2020_hirao.pdf (uwaterloo.ca)

In the last year, I’ve written a lot about code reviews, mostly because this is where I put my effort now and where I see that software engineers could improve.

Although there is a lot of studies about how good code reviews are and what kind of benefits they bring, there is no doubt that code reviews are a tiresome task. You read software code and try to improve it, but, let’s be honest, if it works don’t break it – right?

In this paper, the authors study open source communities and check how often the reviewers actually agree upon the code review score. They find that it’s not that often – 37% disagree. From the paper: “How often do patches receive divergent scores? Results: Divergent review scores are not rare. Indeed, 15%–37% of the studied patch revisions that receive review scores of opposing polarity

They also study how the divergence actually influences the patches – are they integrated or not: “Patches are integrated more often than they are abandoned. For example, patches that elicit positive and negative scores of equal strength are eventually integrated on average 71% of the time. The order in which review scores appear correlates with the integration rate, which tends to increase if negative scores precede positive ones.

Finally, they study when the discussions/disagreements happen and how many reviewers there actually are: “Patches that are eventually integrated involve one or two more reviewers than patches without divergent scores on average. Moreover, positive scores appear before negative scores in 70% of patches with divergent scores. Reviewers may feel pressured to critique such patches before integration (e.g., due to lazy consensus).2 Finally, divergence tends to arise early, with 75% of them occurring by the third (QT) or fourth (OPENSTACK) revision. “

I think that these results say something about our community – that we tend to disagree, but do integrate the code anyways. What does that mean?

It could mean two things, which IMHO are equally valid:

  1. The review comments do not really touch upon crucial aspects and therefore are deemed not so important (e.g. whether we call something weatherType or typeOfWeather as a variable…)
  2. The reviewers’ reputation makes it difficult to get some of the comments through, e.g. when a junior reviewer is calling for a complete overhaul of the architecture.

Either way – I think that the modern code review field is quite active these days and I hope that we can get something done about the speed and quality of these long and tiresome code review processes.

Testing machine learning systems…

Image by Comfreak from Pixabay

https://rdcu.be/caKuc

Today, everybody is talking about machine learning and AI. Some talk about deterministic models, some about statistical ones, some about bayesian, some talk about X-mas 🙂

My experience with working with machine learning is that we need to be very careful what we actually do. If we do the machine learning in the classical sense, e.g. neural network models or decision trees. Then we need to make sure that we test the system alongside the data. Never together with the data. We need to prepare a dataset that we use as a reference and which we know well.

Testing, in that scenario, becomes just like we know it. We can make calculations manually, or just step-by-step, and we can check if the algorithm behaves like this.

Testing the system is also not difficult if we follow principles of good engineering – separation of concerns, modularization, observability.

In the runtime, we need to make sure that we add mechanisms related to such aspects as out-of-bounds distributions and safety cages for ML algorithms.

Either way, I recommend this article for all ML designers and product managers who want to know what’s the state of the art in this field, from the perspective of testing. A good overview, nice reading!

Who and when needs automated code reviews…

https://rdcu.be/caKsW

Image by Arek Socha from Pixabay

Having worked with code reviews for a while, I strongly sympathize with the thesis put forward by the authors of this paper – code review tools are still far from being supporting for software developers.

Yes, they do automate the process and organize it. Yes, they help in assuring that all code is reviewed and yes, they do help to capture problems in the code and help to spread the knowledge.

However, what I expect from such a tool is to help me to find problems in the code. I would like to have a tool that would help me, as a designer, get better: avoid mistakes, use cool programming constructs, make better design. None of the tools I know help with that.

This paper shows that my understanding is similar to the developers studied in the paper. Documentation – automatically fixing and suggesting were top priority. Renaming suggestions, commenting and explaining were some others.

Detection of duplicated code, architectural analysis and similar things were also mentioned as expectations. I cannot agree more! These things are priority 1 – I would also expect them to be there.

Now, some are more difficult that others – like analyzing the architecture. Not a trivial task at all, cause what is the architecture? Where are the patterns? How to find it from the code? How to rely on the tools that research provides? W’re not there yet.

Duplicate code, however, is something we should be able to fix. I’ve looked at some repository that had over 200 papers about code clones, duplicates and what have you. Are all these papers good? Probably not, but even if 10% is good, then here we have 20 tools we can try.

I agree, we do have SonarQube and similar tools, but they are not integrated with code review. I cannot just link to a report from SQ when writing a review comment. I cannot add a review comment to a detected technical debt in SQ. So, no integration then?

Maybe it’s just a friday afternoon thing, but I hope that we can get better in making the last mile with our tools. Hope that we can address the expectations that the developers have …

Data analytics in SE

https://www.sciencedirect.com/science/article/abs/pii/S0950584920301981

Image by Werner Weisser from Pixabay

A few years ago, data analytics and big data were super popular in software engineering. In fact, they were a bit too popular, as many authors quoted big data because they had a diagram in the paper.

Fast forward to today and the situation is a bit different. We are more mature in using data in software development. We know that Big data is about the 5 Vs and that we can reason about it. We also know what providing the diagrams is not the same as using them to direct software development.

I found this paper when looking for literature for our new work on communication in software metrics teams. My colleagues study the communication and found that there can be several sources of confusion. Now, this paper is NOT about the confusion, but about prevalence of data analytics in software engineering. The working definition of Big Data Analytics is as follows in the paper: “Big data analytics is the process of using analysis algorithms running on powerful supporting platforms to uncover potentials concealed in big data, such as hidden patterns or unknown correlations”.

The paper poses three main research questions about the studies conducted in Big Data Analytics, about the approaches used and when they are used. I’m mostly interested in the second – which approaches are used. There, the authors pose three sub-questions:

RQ2.1: What types of analytics have been used in the ASD domain?
RQ2.2: What sources of data have been used?
RQ2.3: What methods, models, or techniques have been utilized in the studies?

In particular, the second one is the most interesting one – sources of data. There, the authors found that there are plenty. The entire table (Table 7 in the paper) is actually too large to quote, but let me just quote one of the categories: Source code and data model:

  • Source code
  • Ruby programs & Ruby on Rails
  • Java programs
  • Function calls
  • Code metrics
  • Development repository
  • Test case
  • Code quality
  • Application data schema

I recommend this as a good reading into the current state-of-the-art in data analytics in software engineering. I think we’ve matured a lot in the last decade as a community and that brings a lot of benefit. Our software development gets better and thus our software gets better.

From the abstract: In total, 88 primary studies were selected and analyzed. Our results show that BDA is employed throughout the whole ASD lifecycle. The results reveal that data-driven software development is focused on the following areas: code repository analytics, defects/bug fixing, testing, project management analytics, and application usage analytics.

Is confusion a factor when reviewing a code?

https://www.win.tue.nl/~aserebre/EMSE2020Felipe.pdf

Image by Myriams-Fotos from Pixabay

Reviewing the code is an art. After working with the topic for a few years, we’ve realized that this is like reading a chat – one person responds to a piece of message sent by another person. The message often being the code and the response being the review comment. What we’ve discovered is that the context of the review is important as well as the possibility to ask questions. We even discuss having a taxonomy of these review comments to ease understanding of “where” in the review process one is at the moment.

This article caught my attention because it is about understading when a reviewer is actually confused when reading the code and making the comment. It’s a very nice piece of work as it combines code review comments analysis and surveys.

The results of the survey are interesting as they point out that the authors are confused much less than the reviewers – which is often caused by the fact that the comment is a response, while the code is the message. Quoting the paper: RQ1 Summary – Reasons for confusion: We found a total of 30 reasons for confusion. The most prevalent are missing rationale, discussion of the solution: non-functional, and lack of familiarity with existing code. We observe that tools (code review, issue tracker, and version control) and communication issues, such as disagreement or ambiguity in communicative intentions, may also cause confusion during code reviews.

Finally, I like the fact that the authors do a full systematic review on the topic and triangulate the results. This work will become a number one reading for my students in the programming course, which will teach them how important good code is!

From the abstract:

Results: From the first study, we build a framework with 30 reasons for confusion, 14 impacts, and 13 coping strategies. The results of the systematic mapping study shows 38 articles addressing the most frequent reasons for confusion. From those articles, we found 19 different solutions for confusion proposed in the literature, and nine impacts were established related to the most frequent reasons for confusion.

Technical debt from the perspective of practitioners – article review

Image by Steve Buissinne from Pixabay

https://link.springer.com/article/10.1007/s10664-020-09832-9?utm_source=toc&utm_medium=email&utm_campaign=toc_10664_25_5&utm_content=etoc_springer_20200904

Technical debt is a great metaphor in software engineering. It provides software engineers with the toolkit to communicate how bad design can affect the product in a long run, and how much it can cost to fix these problems. The metaphor has been implemented in many static analysis tools like SonarQube.

Despite its power in communicating, its not clear whether this metaphor is actually useful. It has some dark sides, which makes it a bit tricky to use it. For example, the “conversion” from a problem to the debt, e.g. lack of getter and setter methods to 0.5 days in debt, is one of these challenges. Its also not always clear which of the technical debt categories apply to which products.

What I like about this paper is that it presents a survey of technical debt. For example, it identifies the top causes of technical debt, such as:

  • deadlines,
  • inappropriate planning,
  • lack of knowledge, and
  • lack of well-defined process.

These challenges are present in most companies today, and the first two – deadlines and inappropriate planning – are often associated with start-ups and agile organizations. I recommend to take a closer look at the mindmap in the paper (Fig. 5) to dive deeper into the causes.

Quote from the abstract: We identified a total of 78 causes and 66 effects, which confirm and also extend the current knowledge on causes and effects of TD. Then, we organized the identified set of causes and effects in probabilistic cause-effect diagrams. The proposed diagrams highlight the causes that can most contribute to the occurrence of TD as well as the most common effects that occur as a result of debt.

Finding lines of code that require review – my 100 blog post!

Image by skeeze from Pixabay

Working with continuous integration is an exciting new filed. You get your code into the main branch directly. Well, that’s what the theory says. What you really get is feedback directly, at least the feedback from the automated checks for technical debt, testing and similar.

What you do not get quickly is the review of your code by your colleagues. In larger organizations, things like code reviews do not get prioritized. Therefore they tend to slow down software development rather than speed up!

In this paper, we set of to understand how to fix that. We used Gerrit as the tool to extract lines of code to review, instead of reviewing all of the lines. Here is a short video about this: https://play.gu.se/media/t/0_h7hx95d2

The abstract of the paper is included:

Code reviews are one of the first quality assurance tasks in continuous software integration and delivery. The goal of our work is to reduce the need for manual reviews by automatically identify which code fragments should be further reviewed manually. We conducted an action research study with two companies where we extracted code reviews and build machine learning classifiers (AdaBoost and Convolutional Neural Network — CNN). Our results show that the accuracy of recognizing code fragments that require manual review, measured with Matthews Correlation Coefficient, was 0.70 in the combination of our own feature extraction and CNN. We conclude that this way of combining automation with manual code reviews can improve the speed of reviews while providing organizations with the possibility to support knowledge transfer among the designers.

Recommending refactoring via commit message analysis

Image by annca from Pixabay

https://doi.org/10.1016/j.infsof.2020.106332

In the process of reviewing code, we can identify refactoring pretty easy. We read the code, try to understand it and provide comments. In the understanding phase we also get ideas about possible alternative – why is this done this way?

Now, when writing the comments, we rarely have the time to refactor the code. In CI, this process of reviewing comes when we commit the code to the main branch and therefore we expect this to be delivered and used soon. So, it’s too late to refactor, we need to do it in the next iteration.

But the next iteration is the same, we need to deliver new functions, not “golden plate” the existing code, deliver it to the main branch, etc. When is the time for refactoring then? How do we document the possibilities and use them when we have a bit of time?

In this work, the authors look at the commit messages and identify refactoring possibilities for that, complementing the static and dynamic analysis of code. The method presented in the paper is based on the analysis of code from open source projects, the refactoring applied to the code and the analysis of the QMOOD quality attributes that were related to these commits.

The following quote from the paper explains a bit how the gist of the extraction of the refactored code works:

Identifying refactoring rationale has two parts. The first part is the detection of the files that are refactored by developers in a commit. The second part is the identification of changes in the QMOOD quality attributes then comparing these changes with the information in the commit message.

For the first part, we used the GitHub API to identify the changed files in each commit. In the second part, we compared the QMOOD quality attribute values before and after the commit to capture the actual quality changes for each file. Once the changed files and quality attributes were identified, we checked if the developers intended to actually improve these files and quality attributes. In fact, we preprocessed the commit messages and we used the names of code elements in the changed files and the changed quality metrics as keywords to match with words in the commit message. Once the refactoring rationale is automatically detected using this procedure, we continue with the next step to find better refactoring recommendations that can fully meet the developer’s intentions and expectations. In case that no quality changes were identified at all then a warning will be generated to developers that the manually applied refactorings are not addressing the quality issues described in his commit message.

If a tool can automatically refactor our code – is it good or bad for us, programmers?

https://link-springer-com.ezproxy.ub.gu.se/article/10.1007/s10664-020-09826-7

Image by GimpWorkshop from Pixabay

Recently, I’ve read an article in Empirical Software Engineering about automated code refactoring. I must admit that I do refactoring quite seldom. It’s a tedious task and for the software that I write, quite unnecessary. My software is often a set of scripts to solve a specific task and then the key is to document it, not refactor. A good documentation helps me to understand what I did in that code and how it works. Yes, I know it sounds like a cliché, but that’s how it is for me. I’m switching tasks so often that I forget what the code was doing.

Nevertheless, I recognize the code that is nicely written, formatted and refactored. Therefore, I was on a lookout for a tool that could do something like that for me – suggest a refactoring that I could implement.

So, this is a paper that I found, which I would like to try out. It is a tool that was evaluated through interviews with designers and developers. Although they can recognize that the code was refactored, but they seemed to be happy about it. So, I’m off to try out the tool:)

Abstract: Refactoring is a maintenance activity that aims to improve design quality while preserving the behavior of a system. Several (semi)automated approaches have been proposed to support developers in this maintenance activity, based on the correction of anti-patterns, which are “poor” solutions to recurring design problems. However, little quantitative evidence exists about the impact of automatically refactored code on program comprehension, and in which context automated refactoring can be as effective as manual refactoring. Leveraging RePOR, an automated refactoring approach based on partial order reduction techniques, we performed an empirical study to investigate whether automated refactoring code structure affects the understandability of systems during comprehension tasks. (1) We surveyed 80 developers, asking them to identify from a set of 20 refactoring changes if they were generated by developers or by a tool, and to rate the refactoring changes according to their design quality; (2) we asked 30 developers to complete code comprehension tasks on 10 systems that were refactored by either a freelancer or an automated refactoring tool. To make comparison fair, for a subset of refactoring actions that introduce new code entities, only synthetic identifiers were presented to practitioners. We measured developers’ performance using the NASA task load index for their effort, the time that they spent performing the tasks, and their percentages of correct answers. Our findings, despite current technology limitations, show that it is reasonable to expect a refactoring tools to match developer code. Indeed, results show that for 3 out of the 5 anti-pattern types studied, developers could not recognize the origin of the refactoring (i.e., whether it was performed by a human or an automatic tool). We also observed that developers do not prefer human refactorings over automated refactorings, except when refactoring Blob classes; and that there is no statistically significant difference between the impact on code understandability of human refactorings and automated refactorings. We conclude that automated refactorings can be as effective as manual refactorings. However, for complex anti-patterns types like the Blob, the perceived quality achieved by developers is slightly higher.

PHANTOM – finding well engineered software projects, fast…

https://link-springer-com.ezproxy.ub.gu.se/article/10.1007%2Fs10664-020-09825-8

Image by 2427999 from Pixabay

I’ve worked with two great students – Peter and Joshua – who wanted to do something interesting. They developed a tool that could replicate a study from other researchers. However, they did it faster and with less data. We also managed to team up with Mirek from Poznan who improved the classification algorithm and asked his colleagues from new, industrial data.

And this is the outcome – a tool that can connect to a git repository and recognise whether your project is well engineered or not. It helps companies to understand whether their teams are working in a structured manner or ad-hoc.

The tool provides the possibility to assess whether a specific repository is in need for maintenance or not.

Abstract:

Context: Within the field of Mining Software Repositories, there are numerous methods employed to filter datasets in order to avoid analysing low-quality projects. Unfortunately, the existing filtering methods have not kept up with the growth of existing data sources, such as GitHub, and researchers often rely on quick and dirty techniques to curate datasets.

Objective: The objective of this study is to develop a method capable of filtering large quantities of software projects in a resource-efficient way.

Method: This study follows the Design Science Research (DSR) methodology. The proposed method, PHANTOM, extracts five measures from Git logs. Each measure is transformed into a time-series, which is represented as a feature vector for clustering using the k-means algorithm.

Results: Using the ground truth from a previous study, PHANTOM was shown to be able to rediscover the ground truth on the training dataset, and was able to identify “engineered” projects with up to 0.87 Precision and 0.94 Recall on the validation dataset. PHANTOM downloaded and processed the metadata of 1,786,601 GitHub repositories in 21.5 days using a single personal computer, which is over 33% faster than the previous study which used a computer cluster of 200 nodes. The possibility of applying the method outside of the open-source community was investigated by curating 100 repositories owned by two companies.

Conclusions: It is possible to use an unsupervised approach to identify engineered projects. PHANTOM was shown to be competitive compared to the existing supervised approaches while reducing the hardware requirements by two orders of magnitude.