What are code reviews really good for?

Visualization of a source code of one module from the Cloudera projects. The embeddings are taken from our team’s neural network. t-SNE is a visualization technique taken from bioinformatics.

Concerns identified in code review: A fine-grained, faceted classification – ScienceDirect

Code reviews are time consuming. And effort intensive. And boring. And needed. Depending whom we ask, we get one of the above answers (well, 80% of the time). The reality is that the code reviews are not the most productive activity. Reading the code and looking for defects is good when we do it once, but when we need to work with it during continuous integration, the story changes. It becomes like studying for the exam or the homework – we do everything else to postpone it. Then someone waits longer or the code quality suffers.

There has been a lot of work done to make this activity more fun – gamification, automated support, using machine learning to filter out the code that we can automatically check – just to name the few. As far as I know, there has not been much work in understanding of what kind of problems code reviews really find.

In this article, the authors address that very question. Admittedly, they only analyzed 7 OSS projects, but their results are still interesting: “We identified 116 defect types that we grouped into 15 groups to create a defect classification. Additionally, 38% of these defects could be automatically detected accurately.

So, that basically means that 38% of defects could be identified by using testing or static analysis (or some other fancy automation technique). This figure summarizes their results (this is a link to the figure in sciencedirect): https://ars.els-cdn.com/content/image/1-s2.0-S0950584922001653-gr5_lrg.jpg

So, what the code reviews are good for? Here is their list:

  • threads,
  • header comments,
  • errors, warnings and logging,
  • test cases,
  • annotations,
  • performance,
  • identifier naming,
  • modifiers,
  • comments,
  • javadoc,
  • design,
  • implementation, and
  • logic and functionality

The list is sorted from the least frequent to the most frequent – so logic and functionality is the category where the code reviews are the most useful for. I need to also say that the frequencies are not super-high – threading is only 1 detected concern, while logic and functionality has 57. So, you know, could be more, given how much time is spent on code reviews. I guess it is what the quality costs nowadays, even though there is no real data on this.

Machine learning in compilers???

BenchPress: A Deep Active Benchmark Generator (arxiv.org)

To be honest, I did not expect machine learning to be part of a compiler… I’ve done programming since I was 13, understood compilers during my second year at the university and even wrote one (well, without any ML, that is).

Why would a compiler need machine learning, I wondered. It’s a pretty simple program – it takes a grammar, then parses the source code and translates that to a machine code (or some other low level representation). It has to be deterministic as the same program cannot compile to two different machine codes. It’s just the way it is….

It turns out that machine learning is used in modern compilers to perform optimizations. The optimizations are done to take advantage of modern processors, their registers and long instructions sets. These optimizations are meant to support machine code in being more parallel, allowing the modern multi-core, multi-thread processors to utilize every little bit of energy in all their cores.

In this paper, the authors use language models like BERT to create a benchmark that will allow different optimization techniques to be compared. This means, that the same compiler, can test itself against these benchmarks in order to find the best possible solution. Clever!

However, this is it from me. I’m planning on writing a compiler, let alone an optimizer. I may use BERT models in the future for generation of programs, but I will most probably end there. But, in case you wonder – there is ML in compilers šŸ™‚

Language models in Software Engineering (new paper review)

Image by Lorenzo Cafaro from Pixabay

Articla available at: https://arxiv.org/pdf/2205.11739.pdf

It’s no secret that I’ve been fascinated by modern, BERT-like language models. I’ve seen what they can do and how they operate, use them in two of my research projects. So, when this paper came around, I read it directly.

It’s a paper which makes an overview of what kind of tasks the language models are used in software engineering today. The list is long and contains a variety of tasks, e.g., code-to-code retrieval, repairing of source code or bug finding/fixing. In total a lot of these tasks, but, IMHO, a bit low-level tasks. There are no tasks that attempt to understand code at the design-level, for example whether we can really see specific design in the code.

The paper also shows which models are used, and provides references to these models. They list 20 models, with the tasks for which they were trained, including the datasets that they were trained on. Fantastic!

I need to dive deeper into these models, but I’m super happy about the fact that there is a list of these models now and that the language technology makes a significant body of work in software engineering now.

Testing of ML systems

BIld av OpenClipart-Vectors frƄn Pixabay

Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

Machine learning systems are very popular today, at least when it comes to research applications. They are not as popular as one would wished (or liked) in the real applications. One of the reasons is the fact that they are hard to test. We do not know how to check if an algorithm will behave as expected in all similar situations – well, we do not know which situations are similar for us and for the ML system.

This paper looks at the problem from a different angle. The research question is: RQ: What are simple and generic software tests that are capable of finding bugs and improving the quality of machine learning algorithms?

The authors developed a set of smoke tests, which they see that all ML algorithms should pass. The paper is quite exhaustive and if you are interested, I recommend to take a look at this table:

Table 1 | Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

I love the article. It is simple, to the point and very applied. I’m going to use that in my tests of ML algorithms in the future.

How good are language models for source code tasks?

https://ieeexplore-ieee-org.ezproxy.ub.gu.se/document/9653849

Using machine learning, and deep learning in particular, for software engineering tasks exploded recently. I would say that it exploded a bit too much. I’m myself to blame here as our team was one of the early adopters with the CCFlex model and source code analysis.

Well, this paper compares a number of modern deep learning models, so called transformers, in various code and comment analysis tasks. The authors did a great job in collecting a set of models and datasets, trained them and critically evaluated the performance.

I recommend reading the entire paper, but what they found was a bit surprised for me. First of all, they found that the transformer models are better for the natural language and not so great for the source code analysis. The hypothesis is that the structure of programs is important here. They have also found that pre-training is important, but not crucial. Pre-training attributes to a moderate effect in the end. The dataset, and its content, is much more important for the task at hand.

This is a great paper and I hope that this can become an essential reading for software engineers working with AI systems engineering supporting the software engineering tasks.

Reviewing rounds prediction for code patches

Image by pixabay

Reviewing rounds prediction for code patches | SpringerLink

Understanding how reviews of source code are done seem to be one of my main interests recently. Partly because reviews are important for software quality, while taking time. Partly, also, because I think it is interesting to check if we can quantify a good opinion from a good software developer.

In this paper, the authors study how to predict to which degree one can predict how many comments a given patch will have. Now, this problem may not be the most exciting ones, but it attracted my attention because of the fact that the authors studied the same projects as we did. However, to the contrary of our work, they also take into consideration features that characterize software developer networks – for example the experience in commenting software patches or the networking.

Now, to the results. The models presented in this paper seem to be quite good in quantifying and predicting patches within the same project – all kinds of predictions have pretty good F1-scores, above 65%. This means that we can train these models for our own projects and to be able to predict whether a particular patch will be commented on once or twice, or even many times.

The performance of the model on the cross-projects dataset. There, the performance is ok for predicting whether a particular patch will be commented on. Predicting how many times the patch will be commented on, or even that it will be commented on many times, does not work very well. The magnitude of the performance measures oscillates close to the 0% mark, which means that the models are not better than just guessing. I guess, you cannot have it all… from one model.

To sum up, the reason I read this article in more detail than others, was essentially not the performance, but tryin to understand the underlying techniques which they use. I’d like to say that they use a good set of features, which I recommend other to use (and will definitely use myself in the next studies), and the fact that they use simple language models, like word2vec, to understand the programming language. What I lack, though, is scrutinizing whether there is a statistically significant dependency between the sentiment (or even its strength) and the length of the discussion.

Rationality by Steven Pinker

Rationality: What It Is, Why It Seems Scarce, Why It Matters : Pinker, Steven: Amazon.se: Bƶcker

I’ve read this book because I wanted to get some inspiration from social sciences. What I ended up with is a book about Bayesian statistics and its consequences. To some extent, I’m happy to have read it, because I got a better view of how to use Bayesian statistics in practice. Yet, I am a bit disappointed. Not about the statistics, but about the book.

A while back I read the book by Judea Pearl about Bayesian networks and this role in statistics. That book was about something new and a bit fresh. It got my interest and I felt refreshed after I read it. This book was a bit too much of a repetition. Don’t get me wrong, I like repetition and I like the style of Steven Pinker (he is one of my role models when it comes to academic writing).

The book is about making inferences by calculating probabilities using Bayes theorem. It explain them very well and shows how they are used and misused in modern society – from politics to medical diagnoses. The book shows a few tricks to make better decisions and inferences.

I guess I need to read it once more in order to get a better taste of its distinct flavor.

Post-Corona branding…

Post Corona: From Crisis to Opportunity: Galloway, Scott: 9780593332214: Amazon.com: Books

A good holiday reading is something that is an essential addition to the time spent with family and friends. Every year I try to get hold of a good book to get inspiration for the upcoming year. Last year, I’ve read “Grit”, which is about perseverance. This year, I noticed a book of NYU Stern professor Scott Galloway – “Post Corona: From Crisis to Opportunity”. Galloway is also the author of “The Four”, which is a book about Apple, Google, Facebook, and Amazon.

Now, to the topic of the day – the post-corona book. I’ve read this book a bit slower than I usually do (which is a good thing). When I read it I had Galloway’s voice in my head talking about the opportunities of large companies – a big tech, as he calls them. His thesis is that the pandemic actually accelerated their growth to the size which makes them really hard to disrupt. By mergers and acquisitions, they can “cannibalize” their competition, unless the competition is them.

I thought that this book would be about Zoom – a company unheard of before the pandemic, now a synonym for a phone call. I thought that the book would be about health services and telemedicine – another area that was small and now is big. Now, it was nothing like that. The book was about The Four and how they capitalize on their brands in times of pandemic.

There is a thesis out there, that if you are getting something for free, it’s not worth much. In this book, Galloway popularizes another thesis – if you are getting something for free, you are the product, not the consumer. He uses this as a way of explaining why Apple charges so much for their products – for not using our data, whereas Google and Facebook/Meta capitalize on our data. Apple connects 200 data points per day from us, while Google collects 2000 data points per hour – a small difference.

I’m not a privacy freak, but I do not want to be a product unless I choose to. I do not want companies to monetize on me, my behavior, and my family. But, and that’s a sad thing, I do want to have great services for a reasonable price. I want my maps to work well – the one in the car’s GPS simply does not make it. I want to watch short tutorials on YouTube – Netflix does not produce tutorials about variational autoencoders (yet).

To sum up, I like the thesis posed by Galloway, that the next big thing taken up by Amazon, will probably be medical insurance or schools. It is not difficult to see that the telemedicine model is essentially mature enough for being disrupted. I really recommend this book as food for thought in the post-pandemic (or endemic) world of 2022.

Test prioritization – a systematic review (review)

Image source: pixabay

Test case selection and prioritization using machine learning: a systematic literature review (springer.com)

Testing is an important activity in every software engineering project. In professional organizations, the process is structured and well-organized. In smaller projects, start-up style organizations, or in research studies, the process is less organized.

There are different views on why we do testing. Some think that we do testing to find defects, some to prove that the software works correctly, finally some think that we do this to waste time (well, not so many maybe). In my experience it is the combination of the first and the second. We do testing to find defects and also to track how good our software gets over time (software reliability growth modelling).

This paper presents a systematic literature review on using machine learning to select and prioritize test cases. I think that the authors summarize their contribution in a very good way (quote):

  • The main ML techniques used for TSP are: supervised learning (ranking models), unsupervised learning (clustering), reinforcement learning, and natural language processing.
  • ML-based TSP techniques mainly rely on features that are easy to compute and based on data that are practical to collect in a CI context, including execution history, coverage information, code complexity, and textual data.
  • ML-based TSP techniques are evaluated using a variety of metrics that are, sometimes, calculated differently in TS and TP, making it difficult to compare their results. Most of the currently available subjects have extremely low failure rates, making them unsuitable for evaluating ML-based TSP techniques.
  • Comparing the performance of ML-based TSP techniques is challenging due to the variation of evaluation metrics, test suite sizes, and failure rates across studies. Reporting failure rates alongside performance values helps provide more interpretable results to the wider research community.
  • Only six out of the 29 selected studies (21%) can be considered reproducible, thus raising methodological issues in the studies and a lack of confidence in reported results.

I think the biggest surprise, for me, is that complexity-based metrics are still used widely in this context. I’m happy that there are new approaches on the rise, for example textual analyses. I guess there is a point in combining approaches, but complexity seems like a very coarse-grained instrument for this type of analysis. We know it correlates well with size, and the larger the test (or UUT), the higher the probability of triggering a failure.

Well, I guess I need to make more experiments myself to check if I miss something.

Merry X-mas and the next year with AI

Image by Peter Pieras from Pixabay

Sparse reward for reinforcement learning‐based continuous integration testing – Yang – – Journal of Software: Evolution and Process – Wiley Online Library

This is the last post that I want to write in 2021. The year has been hectic and full of surprises. First, we got the news that the vaccine works for Covid-19. We all prepared for normalization, for being able to travel, visit friends, families, and conferences in person.

Then came the new variants, like the Omikron, which seem to escape from the vaccine, and countries still are not ready for opening. Conferences get postponed, trips canceled. I hope this is just a temporary situation and that we will be able to get in control of the situation again.

For the last post in 2021, I chose one of the articles that I’ve recently read – about the use of reinforcement learning in integration testing. Kind of a different approach to what we do in the Software Center project.

This paper tackles the problem of sparse rewards for fitness functions when using reinforcement learning for test selection. It proposes a combination of historical data and a function that assigns a higher reward for non-sparse data. It looks like the work is very promising, as it has been tested on 14 different industrial data sets. I need to check if during the coming holidays. It’s a project to do for X-Mas

With that, I would like to thank all of you for being here with me during 2021 and hope that we can continue in 2022. Wish you all great holidays and the best of luck in the coming 2022!

From the abstract:

“Reinforcement learning (RL) has been used to optimize the continuous integration (CI) testing, where the reward plays a key role in directing the adjustment of the test case prioritization (TCP) strategy. In CI testing, the frequency of integration is usually very high, while the failure rate of test cases is low. Consequently, RL will get scarce rewards in CI testing, which may lead to low learning efficiency of RL and even difficulty in convergence. This paper introduces three rewards to tackle the issue of sparse rewards of RL in CI testing. First, the historical failure density-based reward (HFD) is defined, which objectively represents the sparse reward problem. Second, the average failure position-based reward (AFP) is proposed to increase the reward value and reduce the impact of sparse rewards. Furthermore, a technique based on additional reward is proposed, which extracts the test occurrence frequency of passed test cases for additional rewards. Empirical studies are conducted on 14 real industry data sets. The experiment results are promising, especially the reward with additional reward can improve NAPFD (Normalized Average Percentage of Faults Detected) by up to 21.97%, enhance Recall with a maximum of 21.87%, and increase TTF (Test to Fail) by an average of 9.99 positions. “