article review – Page 3 – SE metrics (Software Engineering)

Machine learning in compilers???

BenchPress: A Deep Active Benchmark Generator (arxiv.org)

To be honest, I did not expect machine learning to be part of a compiler… I’ve done programming since I was 13, understood compilers during my second year at the university and even wrote one (well, without any ML, that is).

Why would a compiler need machine learning, I wondered. It’s a pretty simple program – it takes a grammar, then parses the source code and translates that to a machine code (or some other low level representation). It has to be deterministic as the same program cannot compile to two different machine codes. It’s just the way it is….

It turns out that machine learning is used in modern compilers to perform optimizations. The optimizations are done to take advantage of modern processors, their registers and long instructions sets. These optimizations are meant to support machine code in being more parallel, allowing the modern multi-core, multi-thread processors to utilize every little bit of energy in all their cores.

In this paper, the authors use language models like BERT to create a benchmark that will allow different optimization techniques to be compared. This means, that the same compiler, can test itself against these benchmarks in order to find the best possible solution. Clever!

However, this is it from me. I’m planning on writing a compiler, let alone an optimizer. I may use BERT models in the future for generation of programs, but I will most probably end there. But, in case you wonder – there is ML in compilers 🙂

Language models in Software Engineering (new paper review)

Articla available at: https://arxiv.org/pdf/2205.11739.pdf

It’s no secret that I’ve been fascinated by modern, BERT-like language models. I’ve seen what they can do and how they operate, use them in two of my research projects. So, when this paper came around, I read it directly.

It’s a paper which makes an overview of what kind of tasks the language models are used in software engineering today. The list is long and contains a variety of tasks, e.g., code-to-code retrieval, repairing of source code or bug finding/fixing. In total a lot of these tasks, but, IMHO, a bit low-level tasks. There are no tasks that attempt to understand code at the design-level, for example whether we can really see specific design in the code.

The paper also shows which models are used, and provides references to these models. They list 20 models, with the tasks for which they were trained, including the datasets that they were trained on. Fantastic!

I need to dive deeper into these models, but I’m super happy about the fact that there is a list of these models now and that the language technology makes a significant body of work in software engineering now.

Testing of ML systems

BIld av OpenClipart-Vectors från Pixabay

Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

Machine learning systems are very popular today, at least when it comes to research applications. They are not as popular as one would wished (or liked) in the real applications. One of the reasons is the fact that they are hard to test. We do not know how to check if an algorithm will behave as expected in all similar situations – well, we do not know which situations are similar for us and for the ML system.

This paper looks at the problem from a different angle. The research question is: RQ: What are simple and generic software tests that are capable of finding bugs and improving the quality of machine learning algorithms?

The authors developed a set of smoke tests, which they see that all ML algorithms should pass. The paper is quite exhaustive and if you are interested, I recommend to take a look at this table:

Table 1 | Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

I love the article. It is simple, to the point and very applied. I’m going to use that in my tests of ML algorithms in the future.

How good are language models for source code tasks?

https://ieeexplore-ieee-org.ezproxy.ub.gu.se/document/9653849

Using machine learning, and deep learning in particular, for software engineering tasks exploded recently. I would say that it exploded a bit too much. I’m myself to blame here as our team was one of the early adopters with the CCFlex model and source code analysis.

Well, this paper compares a number of modern deep learning models, so called transformers, in various code and comment analysis tasks. The authors did a great job in collecting a set of models and datasets, trained them and critically evaluated the performance.

I recommend reading the entire paper, but what they found was a bit surprised for me. First of all, they found that the transformer models are better for the natural language and not so great for the source code analysis. The hypothesis is that the structure of programs is important here. They have also found that pre-training is important, but not crucial. Pre-training attributes to a moderate effect in the end. The dataset, and its content, is much more important for the task at hand.

This is a great paper and I hope that this can become an essential reading for software engineers working with AI systems engineering supporting the software engineering tasks.

Reviewing rounds prediction for code patches

Reviewing rounds prediction for code patches | SpringerLink

Understanding how reviews of source code are done seem to be one of my main interests recently. Partly because reviews are important for software quality, while taking time. Partly, also, because I think it is interesting to check if we can quantify a good opinion from a good software developer.

In this paper, the authors study how to predict to which degree one can predict how many comments a given patch will have. Now, this problem may not be the most exciting ones, but it attracted my attention because of the fact that the authors studied the same projects as we did. However, to the contrary of our work, they also take into consideration features that characterize software developer networks – for example the experience in commenting software patches or the networking.

Now, to the results. The models presented in this paper seem to be quite good in quantifying and predicting patches within the same project – all kinds of predictions have pretty good F1-scores, above 65%. This means that we can train these models for our own projects and to be able to predict whether a particular patch will be commented on once or twice, or even many times.

The performance of the model on the cross-projects dataset. There, the performance is ok for predicting whether a particular patch will be commented on. Predicting how many times the patch will be commented on, or even that it will be commented on many times, does not work very well. The magnitude of the performance measures oscillates close to the 0% mark, which means that the models are not better than just guessing. I guess, you cannot have it all… from one model.

To sum up, the reason I read this article in more detail than others, was essentially not the performance, but tryin to understand the underlying techniques which they use. I’d like to say that they use a good set of features, which I recommend other to use (and will definitely use myself in the next studies), and the fact that they use simple language models, like word2vec, to understand the programming language. What I lack, though, is scrutinizing whether there is a statistically significant dependency between the sentiment (or even its strength) and the length of the discussion.

Post-Corona branding…

Post Corona: From Crisis to Opportunity: Galloway, Scott: 9780593332214: Amazon.com: Books

A good holiday reading is something that is an essential addition to the time spent with family and friends. Every year I try to get hold of a good book to get inspiration for the upcoming year. Last year, I’ve read “Grit”, which is about perseverance. This year, I noticed a book of NYU Stern professor Scott Galloway – “Post Corona: From Crisis to Opportunity”. Galloway is also the author of “The Four”, which is a book about Apple, Google, Facebook, and Amazon.

Now, to the topic of the day – the post-corona book. I’ve read this book a bit slower than I usually do (which is a good thing). When I read it I had Galloway’s voice in my head talking about the opportunities of large companies – a big tech, as he calls them. His thesis is that the pandemic actually accelerated their growth to the size which makes them really hard to disrupt. By mergers and acquisitions, they can “cannibalize” their competition, unless the competition is them.

I thought that this book would be about Zoom – a company unheard of before the pandemic, now a synonym for a phone call. I thought that the book would be about health services and telemedicine – another area that was small and now is big. Now, it was nothing like that. The book was about The Four and how they capitalize on their brands in times of pandemic.

There is a thesis out there, that if you are getting something for free, it’s not worth much. In this book, Galloway popularizes another thesis – if you are getting something for free, you are the product, not the consumer. He uses this as a way of explaining why Apple charges so much for their products – for not using our data, whereas Google and Facebook/Meta capitalize on our data. Apple connects 200 data points per day from us, while Google collects 2000 data points per hour – a small difference.

I’m not a privacy freak, but I do not want to be a product unless I choose to. I do not want companies to monetize on me, my behavior, and my family. But, and that’s a sad thing, I do want to have great services for a reasonable price. I want my maps to work well – the one in the car’s GPS simply does not make it. I want to watch short tutorials on YouTube – Netflix does not produce tutorials about variational autoencoders (yet).

To sum up, I like the thesis posed by Galloway, that the next big thing taken up by Amazon, will probably be medical insurance or schools. It is not difficult to see that the telemedicine model is essentially mature enough for being disrupted. I really recommend this book as food for thought in the post-pandemic (or endemic) world of 2022.

Test prioritization – a systematic review (review)

Test case selection and prioritization using machine learning: a systematic literature review (springer.com)

Testing is an important activity in every software engineering project. In professional organizations, the process is structured and well-organized. In smaller projects, start-up style organizations, or in research studies, the process is less organized.

There are different views on why we do testing. Some think that we do testing to find defects, some to prove that the software works correctly, finally some think that we do this to waste time (well, not so many maybe). In my experience it is the combination of the first and the second. We do testing to find defects and also to track how good our software gets over time (software reliability growth modelling).

This paper presents a systematic literature review on using machine learning to select and prioritize test cases. I think that the authors summarize their contribution in a very good way (quote):

The main ML techniques used for TSP are: supervised learning (ranking models), unsupervised learning (clustering), reinforcement learning, and natural language processing.
ML-based TSP techniques mainly rely on features that are easy to compute and based on data that are practical to collect in a CI context, including execution history, coverage information, code complexity, and textual data.
ML-based TSP techniques are evaluated using a variety of metrics that are, sometimes, calculated differently in TS and TP, making it difficult to compare their results. Most of the currently available subjects have extremely low failure rates, making them unsuitable for evaluating ML-based TSP techniques.
Comparing the performance of ML-based TSP techniques is challenging due to the variation of evaluation metrics, test suite sizes, and failure rates across studies. Reporting failure rates alongside performance values helps provide more interpretable results to the wider research community.
Only six out of the 29 selected studies (21%) can be considered reproducible, thus raising methodological issues in the studies and a lack of confidence in reported results.

I think the biggest surprise, for me, is that complexity-based metrics are still used widely in this context. I’m happy that there are new approaches on the rise, for example textual analyses. I guess there is a point in combining approaches, but complexity seems like a very coarse-grained instrument for this type of analysis. We know it correlates well with size, and the larger the test (or UUT), the higher the probability of triggering a failure.

Well, I guess I need to make more experiments myself to check if I miss something.

Merry X-mas and the next year with AI

Image by Peter Pieras from Pixabay

Sparse reward for reinforcement learning‐based continuous integration testing – Yang – – Journal of Software: Evolution and Process – Wiley Online Library

This is the last post that I want to write in 2021. The year has been hectic and full of surprises. First, we got the news that the vaccine works for Covid-19. We all prepared for normalization, for being able to travel, visit friends, families, and conferences in person.

Then came the new variants, like the Omikron, which seem to escape from the vaccine, and countries still are not ready for opening. Conferences get postponed, trips canceled. I hope this is just a temporary situation and that we will be able to get in control of the situation again.

For the last post in 2021, I chose one of the articles that I’ve recently read – about the use of reinforcement learning in integration testing. Kind of a different approach to what we do in the Software Center project.

This paper tackles the problem of sparse rewards for fitness functions when using reinforcement learning for test selection. It proposes a combination of historical data and a function that assigns a higher reward for non-sparse data. It looks like the work is very promising, as it has been tested on 14 different industrial data sets. I need to check if during the coming holidays. It’s a project to do for X-Mas

With that, I would like to thank all of you for being here with me during 2021 and hope that we can continue in 2022. Wish you all great holidays and the best of luck in the coming 2022!

From the abstract:

“Reinforcement learning (RL) has been used to optimize the continuous integration (CI) testing, where the reward plays a key role in directing the adjustment of the test case prioritization (TCP) strategy. In CI testing, the frequency of integration is usually very high, while the failure rate of test cases is low. Consequently, RL will get scarce rewards in CI testing, which may lead to low learning efficiency of RL and even difficulty in convergence. This paper introduces three rewards to tackle the issue of sparse rewards of RL in CI testing. First, the historical failure density-based reward (HFD) is defined, which objectively represents the sparse reward problem. Second, the average failure position-based reward (AFP) is proposed to increase the reward value and reduce the impact of sparse rewards. Furthermore, a technique based on additional reward is proposed, which extracts the test occurrence frequency of passed test cases for additional rewards. Empirical studies are conducted on 14 real industry data sets. The experiment results are promising, especially the reward with additional reward can improve NAPFD (Normalized Average Percentage of Faults Detected) by up to 21.97%, enhance Recall with a maximum of 21.87%, and increase TTF (Test to Fail) by an average of 9.99 positions. “

“That will never work” – A book about Netflix

That Will Never Work: The Birth of Netflix by the first CEO and co-founder Marc Randolph : Randolph, Marc: Amazon.se: Böcker

Building a successful start-up seems like a really cool idea – from a distance. I’ve used to teach a course about enterpreneurship, start-up, business models and alike. Although it was nice, I always felt that I’m a person who knows absolutely nothing about this. At least not in practice…

In this book, the original founder of Netflix tells his story about how he took the idea and made it into a product. He tells the story about how the idea hatched, how he, and his team, created a data-driven model of understanding their customers. The book is also about the struggles of start-ups – about taking on investments from the beginning and then being pushed out of the company. It’s about being able to understand what’s best for the company and what’s best for the individual.

I like the way in which the authors describes the story, and also shows a bit of himself: how he felt, how he wanted to build the company and how he decided when to leave (with grace!). I also like his ending of the book – Nobody knows anything! which is a saying that you never really knows what will and will not work in the end.

I recommend this as a Sunday reading to get inspired.