Automating the Measurement of Heterogeneous Chatbot Designs (paper review)

Image by NPXL_Studio from Pixabay

Paper from: http://miso.es/pubs/ACMSAC_2022.pdf

Using chatbots has gained importance in recent years, which has resulted in development of several chatbot platforms (like Amazon Lex, Google DialogFlow or IBM Watson). However, there is a limited number of studies related to quality assurance of chatbots. The paper from Pablo C. Cañizares, Sara Pérez-Soler, Esther Guerra and Juan de Lara addresses just this problem – how to guide testing of chatbots by using design metrics.

The paper proposes six global metrics (e.g., number of intents of the bot), eight intent metrics (e.g., number of training phrases per intent), three entity metrics (e.g., word length), and three flow metrics (e.g., conversation length). By measuring the values for these metrics, software designers of chatbots can predict three usability types – effectiveness, efficiency and satisfaction. To support the measurement process, the paper proposes a tool, available on GitHub, which can be used by practitioners. For some of the metrics, the tool employs machine learning and natural language processing. The metrics and the tool are evaluated on twelve chatbot designs. The tool could identify quality issues in terms of readability, conversation complexity, user experience and bot understanding. This demonstrates the usefulness of the tool in practice and how these metrics can help software developers in designing high-quality bots.

The metrics from the paper are:

  • INT – # intents
  • ENT – # user-defined entities
  • FLOW – # conversation entry points
  • PATH – # different conversation flow paths
  • CNF – # confusing phrases
  • SNT – # positive, neutral, negative output phrases
  • TPI – # training phrases per intent
  • WPTP – # words per training phrase
  • VPTP – # verbs per training phrase
  • PPTP – # parameters per training phrase
  • WPOP – # words per output phrase
  • VPOP – # verbs per output phrase
  • CPOP – # characters per output phrase
  • READ – reading time of the output phrases
  • LPE – # literals per entity
  • SPL – # synonyms per literal
  • WL – word length
  • FACT – # actions per flow
  • FPATH – # conversation flow paths
  • CL – conversation length

I will try to use these metrics if I write chatbot 🙂

What will shape the future of automotive software (engineering)?

Image by Jordan Holiday from Pixabay

Based on the following article + my own thoughts: D08042936-with-cover-page-v2.pdf (d1wqtxts1xzle7.cloudfront.net)

It’s been a while since I’ve written about automotive software, but that does not mean that nothing happened. During the pandemic, the car manufacturers suffered great losses caused by the global shortage of silicon, lack of workforce due to lockdowns and the overall slowdown of development due to the WFH situation.

There are a few trends that shape and will continue to shape the automotive sector. The first one is electrification – as the world is going away from fossil fuels, more cars will need to use electricity. For the software part, this means that there will be fewer components to steer the powertrain, fewer communication buses, and lower complexity. This means that we have some spare computing power for more advanced functionality.

Now, this advanced functionality can come from autonomous driving, which is still an important trend. However, it may also come from increased connectivity and an increased number of smart functions (the ones using machine learning). The increased ability to develop software that utilizes this new power will decide whether a given car is popular or not. By the end of the day, the consumers do not want to have boring cars with bare-minimum functionality. Cars are great, they need to be driven and their driving needs to be fun!

The last trend is the ability to utilize cooperative driving (which the article at the top tackles). To make things work smoothly, we need to coordinate. We can save fuel/energy if we calculate the exact time for one bus to arrive and the next one to leave – that requires coordination. The same goes for trucks, taxis, etc. This increased cooperative driving will also increase the complexity of software and put more requirements on the dependability of it – as one failure can propagate longer than before.

Do explicit review strategies improve code review performance?

Image by Pixabay

Do explicit review strategies improve code review performance? Towards understanding the role of cognitive load | SpringerLink

I’ve written a lot about code reviews and I’ve done my share of experimentation in software engineering. When I started my career, using inspections (like Fagan-style code inspections) was the primary source of experimentation. It was how I learned to experiment, although I never experimented with code inspections.

So, when getting my hands on this article, I thought that this is just one of the same, but in a different context – whether guided reading actually improves effectiveness and efficiency of code review processes. The effectiveness and efficiency are measured in the standard way – using defects as the output of the review process. But, there is something new with this study.

First of all, this is a study done with professional developers. The authors have designed an experiment and employed professional, though junior, developers to conduct it. Second of all, this is an experiment in the context of modern code reviews (Git, Gerrit, that sort of thing). Third, the results are not that convincing any more.

I encourage you to read the entire paper, but let’s dive a bit deeper into some of the results. For example, the experiment found that it is not always the case that guidance is better. It provides more cognitive load (the reviewers need to understand the guidance as well as the code), and it can be downright misleading. It pays off for longer and more complex code fragments.

The experiment also found that the complexity of the actual guidance (checklist) plays an important role – shorter, less cognitively demanding lists, are preferred. This is an important finding as, to my best knowledge, no one has ever said that. Checklists and perspective-based reading techniques assumed that more extra information equals better results. This experiment says that a well-balanced information is better than more information. I know, seems kind of obvious when you think of that, but it was not really considered up until now.

Finally, the most significant factor, found in this experiment, was that it is the understanding of the code that makes a review better or worse, not the guidance. At least not the guidance on a general level (like “Are all data types declared correctly?”).

What I make out of that is that there is nothing that substitutes knowledge. If you want to get something done, you need to put the hours into this.

I know, kids may not like it….

Understanding anomalies in software data

Image by pixabay

Identifying and classifying anomalies in software engineering data is a well-known field. Using ML to identify intrusion attacks, credit card frauds, defects in production systems – are just a few of the examples of how broad the field is. Wherever we have data, we can have anomalies.

Both types of anomalies have similarities, but also differences, which provides us with an opportunity to study which of the algorithms for anomaly detection work best. We tried both the ML algorithms and domain-specific ones. Well, spoiler alert – not much has actually worked.

In our project together with Sahlgrenska University Hospital and Chalmers AI Research Center ( Chalmers AI Research Centre – Chair | Chalmers ), anomalies come in two shapes. One type of anomaly is the set of disturbances in radio networks, such as rain or wind. The other type is a specific type of event during surgeries, such as clamping of the carotid artery.

What works, on the other hand, is when we pivot on the problem. Instead of identifying anomalies, we can search for anomalies of a specific type. Instead of defining an anomaly as something deviating from the normal operations, we can say that we look for specific, though rare, events.

So far, we can identify anomalies pretty well and we work on being better to classify them automatically. So stay tuned if you would like to know more.

Review4Repair – article review

https://www.sciencedirect.com/science/article/pii/S0950584921002111?casa_token=D32qAALSwPQAAAAA:2Vm5zjUzFncLhQ_eZTUueefRqllb8fwBEnfWJfbNO_TYMcIjuumTBL9wWCLkscBko53FPR5cRQ

Reviewing source code is something that I talked about a few times already. It is an activity which is almost as old as software engineering itself. Back in the beginning, this activity was done manually just before the release, as a complement to testing.

Then came google with their software engineering practices, something that is known today as “shift left”, meaning that software quality assurance activities should be done close to when the source code is actually developed. Then came Microsoft with their “Modern Code Reviews” that advocated code reviews before actually committing the source code to the main branch.

Now, we are pretty good at reviewing source code. It is an activity which is done rather fast. As the amount of data from code reviews grows, we are getting more eager to try to use AI and ML for this task. This article is a very good example of that. The authors leveraged on the ability to use seq2seq code summarization techniques to match source code with comments and then with code repair suggestions. The results are promising and show that they are able to provide a relevant suggestion in 1 out of 5 cases. One out of three if we consider top 10 suggestions.

I’ve read this article with huge interest and I will try this myself. All data and code is available publicly, which allows to play around with this technology on your own system. Spoiler alert, though, that training takes one week per pass on a TPU.

A Thousands Brain theory, book review

Image by Pixabay

A Thousand Brains: A New Theory of Intelligence : Hawkins, Jeff: Amazon.se: Böcker

This is a new book, written by one of the people behind Palm Pilot, who is both an engineer and a neuroscientist. The book proposes a theory on how to describe neocortex and its functions.

Why is neocortex so important, one may ask. It holds our intelligence and our consciousness. Some would say that it is the place which defines us as humans, which allows us to be aware and intelligent.

The interesting part of this book is the fact that it attempts to provide guidelines on the future of machine intelligence and machine learning. It shows different paths to achieve AGI (Artificial General Intelligence): either as developing a lot of specialized models and winding them together, or as making one large model for everything.

These two approaches are already present in the modern AI community. The latter one (large model for everything) can be seen in the work of OpenAI and GPT-3. The scientists behind that model train it on large corpora of text, hoping that it can understand our natural language and execute our commands. Well, for now it is mostly about creating programs.

The first approach is generally the original idea of AI and ML. The original idea is about training models for specific tasks, such as image recognition, classification, text translation. This is where most of the current research lies and where we have observed the latest breakthroughs – AlphaGo, AlphaStar.

However, the thesis in the book is that the approach of one large model is more natural, similar to how our brain works. The theory of how neocortex stores, frames and recalls information is the core of what we need to achieve in order to make it work in practice.

Well, there is more to the book than I can write in a blog post, so I strongly recommend to read it and reflect on how we use ML and AI today. I’m going to try few of these ideas in 2022!

Death interrupted by Jose Saramago, book review

Only for those who love Terry Pratchett….

Every now and then I read what the Nobel prize winners in literature write. This year, I did read this novel by 1998 winner.

The novel is quite interesting and very well written. If you like Terry Pratchett, then you will find it very good. The synopsis is that one day, Death decides to take a break and do not visit England. It means that no one in England dies that day, and for a while onwards.

At first, this situation brings a lot of joy – people do not die in car accidents, from diseases, etc. After a while, however, this situation brings a few challenges.

First, the challenges for the healthcare – there is not enough room in hospitals anymore, as the very sick patients are in the wards for ever.

Then, the economical – the funeral entrepreneurs are getting out of business.

Finally, the mafia comes in and starts smuggling people to neighboring countries, where they instantly die. For a small fee, naturally.

Now, in the middle of the book, the death returns, but changes the way which she works. Instead of using the scythe to take lives, she uses it to send letters. The letters, nicely put in a violet envelope, have the same effect – the recipient dies.

Again, this sounds like a great idea, but depends on the post office (of course) and the timing of sending the letters, as well as the timing for receiving the letters.

The book ends with Death coming down to earth dressed as a woman, as she fell for a cellist. She watched him perform, visit him at his home, keep his pet on her laps.

I’m not going to spoil the book for you, so here is where I stop.

If you like the style of Terry Pratchett, this book is for you!

Legacy code…

I stumbled across a great talk from Dylan Beattie about legacy code. It is a pre-pandemic talk, but it opens up with a great song and talks about legacy code differently than what we usually do.

There is a lot of great material and food for thought in this video, but I would like to turn your attention to minute 26, where Dylan talks about Excel and how the world runs on it.

He says that a lot of things are actually built on top of Excel because it is essentially a functional language of sorts. The software developed on top of Excel is also the software that is NOT written by professional programmers and software engineers. Yet, it is prevalent in modern society.

Don’t get me wrong. I am in favor of Excel. Love the tool and what Microsoft has done with it. It is so flexible that it can be used with almost all programming environments – from the built-in VBA (I know, ancient history), to Python or C#. We’ve done our share of Excel programming back in a day, e.g. designed measurement systems based on it: A framework for developing measurement systems and its industrial evaluation – ScienceDirect

I agree, the tool is not perfect, but it is installed on ALL office computers and can be executed by anybody. Just open up the file and run it. That’s why we chose it for the measurement systems. Well, at least until we had to do a big rewrite and go to SQL, dashboards, etc…

As I said – history.

Predicting defects on the line level, article review

Image by pixabay

IEEE Xplore Full-Text PDF:

A lot has been written about defect prediction, and I’m pretty sure that a lot will be written. It’s one of the research areas which is quite cool to work with because it provides researchers with quite quick results and is relatively quantitative in its nature.

One could also say that this is a holy grail in software development – to predict a location of a defect and fix it before it becomes a problem. It’s a good goal, but it is also a goal that is more like quicksand than a gravel road. Well, for one, not all defects are easy to recognize. Some are not even certain to be defects – sometimes it is not clear how to interpret a requirement, so it’s not easy to say if a piece of code is implementing it correctly or not.

In this paper, the authors have done a great job in creating a system to predict defect location on line-level – DeepLineDP. The requirements for the system are partially based on a survey conducted by the authors with developers.

According to the authors: “DeepLineDP is 14%-24% more accurate than other file-level defect prediction approaches; is 50%-250% more cost-effective than other line-level defect prediction approaches; and achieves a reasonable performance when transferred to other software projects. These findings confirm that the surrounding tokens and surrounding lines should be considered to identify the fine-grained locations of defective files (i.e., defective lines). “

I like this work and I recommend everyone interested in how to use deep learning for code tasks to look at this work.

Our team has done some of these investigations ourselves. You can watch them on Youtube here:


Test prioritization – a systematic review (review)

Image source: pixabay

Test case selection and prioritization using machine learning: a systematic literature review (springer.com)

Testing is an important activity in every software engineering project. In professional organizations, the process is structured and well-organized. In smaller projects, start-up style organizations, or in research studies, the process is less organized.

There are different views on why we do testing. Some think that we do testing to find defects, some to prove that the software works correctly, finally some think that we do this to waste time (well, not so many maybe). In my experience it is the combination of the first and the second. We do testing to find defects and also to track how good our software gets over time (software reliability growth modelling).

This paper presents a systematic literature review on using machine learning to select and prioritize test cases. I think that the authors summarize their contribution in a very good way (quote):

  • The main ML techniques used for TSP are: supervised learning (ranking models), unsupervised learning (clustering), reinforcement learning, and natural language processing.
  • ML-based TSP techniques mainly rely on features that are easy to compute and based on data that are practical to collect in a CI context, including execution history, coverage information, code complexity, and textual data.
  • ML-based TSP techniques are evaluated using a variety of metrics that are, sometimes, calculated differently in TS and TP, making it difficult to compare their results. Most of the currently available subjects have extremely low failure rates, making them unsuitable for evaluating ML-based TSP techniques.
  • Comparing the performance of ML-based TSP techniques is challenging due to the variation of evaluation metrics, test suite sizes, and failure rates across studies. Reporting failure rates alongside performance values helps provide more interpretable results to the wider research community.
  • Only six out of the 29 selected studies (21%) can be considered reproducible, thus raising methodological issues in the studies and a lack of confidence in reported results.

I think the biggest surprise, for me, is that complexity-based metrics are still used widely in this context. I’m happy that there are new approaches on the rise, for example textual analyses. I guess there is a point in combining approaches, but complexity seems like a very coarse-grained instrument for this type of analysis. We know it correlates well with size, and the larger the test (or UUT), the higher the probability of triggering a failure.

Well, I guess I need to make more experiments myself to check if I miss something.