Automating the Measurement of Heterogeneous Chatbot Designs (paper review)

Image by NPXL_Studio from Pixabay

Paper from: http://miso.es/pubs/ACMSAC_2022.pdf

Using chatbots has gained importance in recent years, which has resulted in development of several chatbot platforms (like Amazon Lex, Google DialogFlow or IBM Watson). However, there is a limited number of studies related to quality assurance of chatbots. The paper from Pablo C. Cañizares, Sara Pérez-Soler, Esther Guerra and Juan de Lara addresses just this problem – how to guide testing of chatbots by using design metrics.

The paper proposes six global metrics (e.g., number of intents of the bot), eight intent metrics (e.g., number of training phrases per intent), three entity metrics (e.g., word length), and three flow metrics (e.g., conversation length). By measuring the values for these metrics, software designers of chatbots can predict three usability types – effectiveness, efficiency and satisfaction. To support the measurement process, the paper proposes a tool, available on GitHub, which can be used by practitioners. For some of the metrics, the tool employs machine learning and natural language processing. The metrics and the tool are evaluated on twelve chatbot designs. The tool could identify quality issues in terms of readability, conversation complexity, user experience and bot understanding. This demonstrates the usefulness of the tool in practice and how these metrics can help software developers in designing high-quality bots.

The metrics from the paper are:

  • INT – # intents
  • ENT – # user-defined entities
  • FLOW – # conversation entry points
  • PATH – # different conversation flow paths
  • CNF – # confusing phrases
  • SNT – # positive, neutral, negative output phrases
  • TPI – # training phrases per intent
  • WPTP – # words per training phrase
  • VPTP – # verbs per training phrase
  • PPTP – # parameters per training phrase
  • WPOP – # words per output phrase
  • VPOP – # verbs per output phrase
  • CPOP – # characters per output phrase
  • READ – reading time of the output phrases
  • LPE – # literals per entity
  • SPL – # synonyms per literal
  • WL – word length
  • FACT – # actions per flow
  • FPATH – # conversation flow paths
  • CL – conversation length

I will try to use these metrics if I write chatbot 🙂

What will shape the future of automotive software (engineering)?

Image by Jordan Holiday from Pixabay

Based on the following article + my own thoughts: D08042936-with-cover-page-v2.pdf (d1wqtxts1xzle7.cloudfront.net)

It’s been a while since I’ve written about automotive software, but that does not mean that nothing happened. During the pandemic, the car manufacturers suffered great losses caused by the global shortage of silicon, lack of workforce due to lockdowns and the overall slowdown of development due to the WFH situation.

There are a few trends that shape and will continue to shape the automotive sector. The first one is electrification – as the world is going away from fossil fuels, more cars will need to use electricity. For the software part, this means that there will be fewer components to steer the powertrain, fewer communication buses, and lower complexity. This means that we have some spare computing power for more advanced functionality.

Now, this advanced functionality can come from autonomous driving, which is still an important trend. However, it may also come from increased connectivity and an increased number of smart functions (the ones using machine learning). The increased ability to develop software that utilizes this new power will decide whether a given car is popular or not. By the end of the day, the consumers do not want to have boring cars with bare-minimum functionality. Cars are great, they need to be driven and their driving needs to be fun!

The last trend is the ability to utilize cooperative driving (which the article at the top tackles). To make things work smoothly, we need to coordinate. We can save fuel/energy if we calculate the exact time for one bus to arrive and the next one to leave – that requires coordination. The same goes for trucks, taxis, etc. This increased cooperative driving will also increase the complexity of software and put more requirements on the dependability of it – as one failure can propagate longer than before.

Do explicit review strategies improve code review performance?

Image by Pixabay

Do explicit review strategies improve code review performance? Towards understanding the role of cognitive load | SpringerLink

I’ve written a lot about code reviews and I’ve done my share of experimentation in software engineering. When I started my career, using inspections (like Fagan-style code inspections) was the primary source of experimentation. It was how I learned to experiment, although I never experimented with code inspections.

So, when getting my hands on this article, I thought that this is just one of the same, but in a different context – whether guided reading actually improves effectiveness and efficiency of code review processes. The effectiveness and efficiency are measured in the standard way – using defects as the output of the review process. But, there is something new with this study.

First of all, this is a study done with professional developers. The authors have designed an experiment and employed professional, though junior, developers to conduct it. Second of all, this is an experiment in the context of modern code reviews (Git, Gerrit, that sort of thing). Third, the results are not that convincing any more.

I encourage you to read the entire paper, but let’s dive a bit deeper into some of the results. For example, the experiment found that it is not always the case that guidance is better. It provides more cognitive load (the reviewers need to understand the guidance as well as the code), and it can be downright misleading. It pays off for longer and more complex code fragments.

The experiment also found that the complexity of the actual guidance (checklist) plays an important role – shorter, less cognitively demanding lists, are preferred. This is an important finding as, to my best knowledge, no one has ever said that. Checklists and perspective-based reading techniques assumed that more extra information equals better results. This experiment says that a well-balanced information is better than more information. I know, seems kind of obvious when you think of that, but it was not really considered up until now.

Finally, the most significant factor, found in this experiment, was that it is the understanding of the code that makes a review better or worse, not the guidance. At least not the guidance on a general level (like “Are all data types declared correctly?”).

What I make out of that is that there is nothing that substitutes knowledge. If you want to get something done, you need to put the hours into this.

I know, kids may not like it….