SE metrics (Software Engineering) – Page 7 – Software engineering, metrics, functional safety …

Testing deep neural networks (article highlight)

A Probabilistic Framework for Mutation Testing in Deep Neural Networks (arxiv.org)

Testing of neural networks is still an open problem. Due to the complexity of their connections, and their probabilistic nature, it is difficult to find defects. Although there is a lot of approaches, e.g., using autoencoders or using surprise adequacy measures, testing of neural networks is something of a mistery for me.

I could say that the topic was under my radar for a while. I actually though that there is not much need for testing research in software engineering; even though I run two projects with the testing components. For one, I thought that deep learning is basically like a “rabbit hole” – the more you test it, the more interesting properties you discover. I’ve tried to use testing to understand what kind of things the models learn, but I’m not sure that this is the right approach. I’m affraid that this will never be the case – the deep learning models learn something, we can evaluate it, but we can never really fully understand what the models has learned.

Now, this article uses mutation testing for the purpose to find the best test suite to validate the models. Well, it does more than that. It offers a framework where we can use three different models to evaluate the mutants and choose the ones that are expected to provide the best results. It is built on top of frameworks/models like DeepCrime (link here) and can provide a better selection approach. So far, the framework has been evaluated on the standard dataset – MNIST – but I hope that it will be expanded on other datasets in the future.

Testing deep learning systems in automotive software – article highlight

IEEE Xplore Full-Text PDF:

The summer is in full swing, and after a few weeks of leisure and relaxation, I’m back to work. In one of our research projects, we examine the ability to test deep learning systems for computer vision in autonomous drive systems. It’s been a challenge, as the field is rather scattered. There is a lot of work on testing DL systems but without the specifics of the safety or autonomous drive. At the same time, there are a lot of studies about testing autonomous systems – usually using simulations.

So, in this paper, the authors focus on using metamorphic testing to test DL networks. By manipulating the input images, they observe how the network reacts and what the predicted behavior is. This helps to establish some sort of boundaries regarding when the system is safe to operate and how it can behave in practice. It allows an understanding of which neurons were essentially activated in the network (which is not the same as network coverage).

The paper presents a tool for that purpose, which is something that I really need to try on our autoencoders from the DeVELOP project.

Language models in Software Engineering (new paper review)

Articla available at: https://arxiv.org/pdf/2205.11739.pdf

It’s no secret that I’ve been fascinated by modern, BERT-like language models. I’ve seen what they can do and how they operate, use them in two of my research projects. So, when this paper came around, I read it directly.

It’s a paper which makes an overview of what kind of tasks the language models are used in software engineering today. The list is long and contains a variety of tasks, e.g., code-to-code retrieval, repairing of source code or bug finding/fixing. In total a lot of these tasks, but, IMHO, a bit low-level tasks. There are no tasks that attempt to understand code at the design-level, for example whether we can really see specific design in the code.

The paper also shows which models are used, and provides references to these models. They list 20 models, with the tasks for which they were trained, including the datasets that they were trained on. Fantastic!

I need to dive deeper into these models, but I’m super happy about the fact that there is a list of these models now and that the language technology makes a significant body of work in software engineering now.

Automating the Measurement of Heterogeneous Chatbot Designs (paper review)

Paper from: http://miso.es/pubs/ACMSAC_2022.pdf

Using chatbots has gained importance in recent years, which has resulted in development of several chatbot platforms (like Amazon Lex, Google DialogFlow or IBM Watson). However, there is a limited number of studies related to quality assurance of chatbots. The paper from Pablo C. Cañizares, Sara Pérez-Soler, Esther Guerra and Juan de Lara addresses just this problem – how to guide testing of chatbots by using design metrics.

The paper proposes six global metrics (e.g., number of intents of the bot), eight intent metrics (e.g., number of training phrases per intent), three entity metrics (e.g., word length), and three flow metrics (e.g., conversation length). By measuring the values for these metrics, software designers of chatbots can predict three usability types – effectiveness, efficiency and satisfaction. To support the measurement process, the paper proposes a tool, available on GitHub, which can be used by practitioners. For some of the metrics, the tool employs machine learning and natural language processing. The metrics and the tool are evaluated on twelve chatbot designs. The tool could identify quality issues in terms of readability, conversation complexity, user experience and bot understanding. This demonstrates the usefulness of the tool in practice and how these metrics can help software developers in designing high-quality bots.

The metrics from the paper are:

INT – # intents
ENT – # user-defined entities
FLOW – # conversation entry points
PATH – # different conversation flow paths
CNF – # confusing phrases
SNT – # positive, neutral, negative output phrases
TPI – # training phrases per intent
WPTP – # words per training phrase
VPTP – # verbs per training phrase
PPTP – # parameters per training phrase
WPOP – # words per output phrase
VPOP – # verbs per output phrase
CPOP – # characters per output phrase
READ – reading time of the output phrases
LPE – # literals per entity
SPL – # synonyms per literal
WL – word length
FACT – # actions per flow
FPATH – # conversation flow paths
CL – conversation length

I will try to use these metrics if I write chatbot 🙂

What will shape the future of automotive software (engineering)?

Based on the following article + my own thoughts: D08042936-with-cover-page-v2.pdf (d1wqtxts1xzle7.cloudfront.net)

It’s been a while since I’ve written about automotive software, but that does not mean that nothing happened. During the pandemic, the car manufacturers suffered great losses caused by the global shortage of silicon, lack of workforce due to lockdowns and the overall slowdown of development due to the WFH situation.

There are a few trends that shape and will continue to shape the automotive sector. The first one is electrification – as the world is going away from fossil fuels, more cars will need to use electricity. For the software part, this means that there will be fewer components to steer the powertrain, fewer communication buses, and lower complexity. This means that we have some spare computing power for more advanced functionality.

Now, this advanced functionality can come from autonomous driving, which is still an important trend. However, it may also come from increased connectivity and an increased number of smart functions (the ones using machine learning). The increased ability to develop software that utilizes this new power will decide whether a given car is popular or not. By the end of the day, the consumers do not want to have boring cars with bare-minimum functionality. Cars are great, they need to be driven and their driving needs to be fun!

The last trend is the ability to utilize cooperative driving (which the article at the top tackles). To make things work smoothly, we need to coordinate. We can save fuel/energy if we calculate the exact time for one bus to arrive and the next one to leave – that requires coordination. The same goes for trucks, taxis, etc. This increased cooperative driving will also increase the complexity of software and put more requirements on the dependability of it – as one failure can propagate longer than before.

Do explicit review strategies improve code review performance?

Do explicit review strategies improve code review performance? Towards understanding the role of cognitive load | SpringerLink

I’ve written a lot about code reviews and I’ve done my share of experimentation in software engineering. When I started my career, using inspections (like Fagan-style code inspections) was the primary source of experimentation. It was how I learned to experiment, although I never experimented with code inspections.

So, when getting my hands on this article, I thought that this is just one of the same, but in a different context – whether guided reading actually improves effectiveness and efficiency of code review processes. The effectiveness and efficiency are measured in the standard way – using defects as the output of the review process. But, there is something new with this study.

First of all, this is a study done with professional developers. The authors have designed an experiment and employed professional, though junior, developers to conduct it. Second of all, this is an experiment in the context of modern code reviews (Git, Gerrit, that sort of thing). Third, the results are not that convincing any more.

I encourage you to read the entire paper, but let’s dive a bit deeper into some of the results. For example, the experiment found that it is not always the case that guidance is better. It provides more cognitive load (the reviewers need to understand the guidance as well as the code), and it can be downright misleading. It pays off for longer and more complex code fragments.

The experiment also found that the complexity of the actual guidance (checklist) plays an important role – shorter, less cognitively demanding lists, are preferred. This is an important finding as, to my best knowledge, no one has ever said that. Checklists and perspective-based reading techniques assumed that more extra information equals better results. This experiment says that a well-balanced information is better than more information. I know, seems kind of obvious when you think of that, but it was not really considered up until now.

Finally, the most significant factor, found in this experiment, was that it is the understanding of the code that makes a review better or worse, not the guidance. At least not the guidance on a general level (like “Are all data types declared correctly?”).

What I make out of that is that there is nothing that substitutes knowledge. If you want to get something done, you need to put the hours into this.

I know, kids may not like it….

Understanding anomalies in software data

Identifying and classifying anomalies in software engineering data is a well-known field. Using ML to identify intrusion attacks, credit card frauds, defects in production systems – are just a few of the examples of how broad the field is. Wherever we have data, we can have anomalies.

Both types of anomalies have similarities, but also differences, which provides us with an opportunity to study which of the algorithms for anomaly detection work best. We tried both the ML algorithms and domain-specific ones. Well, spoiler alert – not much has actually worked.

In our project together with Sahlgrenska University Hospital and Chalmers AI Research Center ( Chalmers AI Research Centre – Chair | Chalmers ), anomalies come in two shapes. One type of anomaly is the set of disturbances in radio networks, such as rain or wind. The other type is a specific type of event during surgeries, such as clamping of the carotid artery.

What works, on the other hand, is when we pivot on the problem. Instead of identifying anomalies, we can search for anomalies of a specific type. Instead of defining an anomaly as something deviating from the normal operations, we can say that we look for specific, though rare, events.

So far, we can identify anomalies pretty well and we work on being better to classify them automatically. So stay tuned if you would like to know more.

Testing of ML systems

BIld av OpenClipart-Vectors från Pixabay

Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

Machine learning systems are very popular today, at least when it comes to research applications. They are not as popular as one would wished (or liked) in the real applications. One of the reasons is the fact that they are hard to test. We do not know how to check if an algorithm will behave as expected in all similar situations – well, we do not know which situations are similar for us and for the ML system.

This paper looks at the problem from a different angle. The research question is: RQ: What are simple and generic software tests that are capable of finding bugs and improving the quality of machine learning algorithms?

The authors developed a set of smoke tests, which they see that all ML algorithms should pass. The paper is quite exhaustive and if you are interested, I recommend to take a look at this table:

Table 1 | Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

I love the article. It is simple, to the point and very applied. I’m going to use that in my tests of ML algorithms in the future.

How good are language models for source code tasks?

https://ieeexplore-ieee-org.ezproxy.ub.gu.se/document/9653849

Using machine learning, and deep learning in particular, for software engineering tasks exploded recently. I would say that it exploded a bit too much. I’m myself to blame here as our team was one of the early adopters with the CCFlex model and source code analysis.

Well, this paper compares a number of modern deep learning models, so called transformers, in various code and comment analysis tasks. The authors did a great job in collecting a set of models and datasets, trained them and critically evaluated the performance.

I recommend reading the entire paper, but what they found was a bit surprised for me. First of all, they found that the transformer models are better for the natural language and not so great for the source code analysis. The hypothesis is that the structure of programs is important here. They have also found that pre-training is important, but not crucial. Pre-training attributes to a moderate effect in the end. The dataset, and its content, is much more important for the task at hand.

This is a great paper and I hope that this can become an essential reading for software engineers working with AI systems engineering supporting the software engineering tasks.

Review4Repair – article review

https://www.sciencedirect.com/science/article/pii/S0950584921002111?casa_token=D32qAALSwPQAAAAA:2Vm5zjUzFncLhQ_eZTUueefRqllb8fwBEnfWJfbNO_TYMcIjuumTBL9wWCLkscBko53FPR5cRQ

Reviewing source code is something that I talked about a few times already. It is an activity which is almost as old as software engineering itself. Back in the beginning, this activity was done manually just before the release, as a complement to testing.

Then came google with their software engineering practices, something that is known today as “shift left”, meaning that software quality assurance activities should be done close to when the source code is actually developed. Then came Microsoft with their “Modern Code Reviews” that advocated code reviews before actually committing the source code to the main branch.

Now, we are pretty good at reviewing source code. It is an activity which is done rather fast. As the amount of data from code reviews grows, we are getting more eager to try to use AI and ML for this task. This article is a very good example of that. The authors leveraged on the ability to use seq2seq code summarization techniques to match source code with comments and then with code repair suggestions. The results are promising and show that they are able to provide a relevant suggestion in 1 out of 5 cases. One out of three if we consider top 10 suggestions.

I’ve read this article with huge interest and I will try this myself. All data and code is available publicly, which allows to play around with this technology on your own system. Spoiler alert, though, that training takes one week per pass on a TPU.