metrics – SE metrics (Software Engineering)

How to improve the code reviews

https://dl.acm.org/doi/pdf/10.1145/3660806

I have written a lot about code reviews. Not because this is my favourite activity, but because I think that there is a need for improvement there. Reading others’ code is not as fun as we may think and therefore making it a bit more interesting is much desirable.

This paper caught my attention because of the practical focus of it. In particular, the abstract caught my attention – they claim that changing the order of files presented for review makes a lot of difference. Up to 23% more comments are written when the files are arranged in the right order. Not only that, the quality of the comments seems to increase too. More tips:

Re-order Files Based on Hot-Spot Prediction: The study found that reordering the files changed by a patch to prioritize hot-spots—files that are likely to require comments or revisions—improves the quality of code reviews. Implementing a system that automatically reorders files based on predicted hot-spots could make the review process more efficient, as it leads to more targeted comments and a better focus on critical areas.
Focus on Size-Based Features: The study highlighted that size-based features (like the number of lines added or removed) are the most important when predicting review activities. Emphasizing these features when prioritizing files or creating models for review could further streamline the process.
Utilize Large Language Models (LLMs): LLMs, such as those used for encoding text, have shown potential in capturing the essence of code changes more effectively than simpler models like Bag-of-Words. Incorporating LLMs into the review tools could improve the detection of complex or nuanced issues in the code.
Automate Hot-Spot Detection and Highlighting: The positive impact of automatically identifying and prioritizing hot-spots suggests that integrating such automation into existing code review tools could significantly enhance the efficiency and effectiveness of the review process.

Sounds like this is one of the examples where we can see the large benefits of using LLMs in code reviews. I hope that this will make it into more companies than Ubisoft (partner on that paper).

I’ve asked ChatGPT to provide me an example of how to create such a hotspot model and it seems that this can be implemented in practice very easily. I will not paste it here, but please try for yourself.

Automating the Measurement of Heterogeneous Chatbot Designs (paper review)

Paper from: http://miso.es/pubs/ACMSAC_2022.pdf

Using chatbots has gained importance in recent years, which has resulted in development of several chatbot platforms (like Amazon Lex, Google DialogFlow or IBM Watson). However, there is a limited number of studies related to quality assurance of chatbots. The paper from Pablo C. Cañizares, Sara Pérez-Soler, Esther Guerra and Juan de Lara addresses just this problem – how to guide testing of chatbots by using design metrics.

The paper proposes six global metrics (e.g., number of intents of the bot), eight intent metrics (e.g., number of training phrases per intent), three entity metrics (e.g., word length), and three flow metrics (e.g., conversation length). By measuring the values for these metrics, software designers of chatbots can predict three usability types – effectiveness, efficiency and satisfaction. To support the measurement process, the paper proposes a tool, available on GitHub, which can be used by practitioners. For some of the metrics, the tool employs machine learning and natural language processing. The metrics and the tool are evaluated on twelve chatbot designs. The tool could identify quality issues in terms of readability, conversation complexity, user experience and bot understanding. This demonstrates the usefulness of the tool in practice and how these metrics can help software developers in designing high-quality bots.

The metrics from the paper are:

INT – # intents
ENT – # user-defined entities
FLOW – # conversation entry points
PATH – # different conversation flow paths
CNF – # confusing phrases
SNT – # positive, neutral, negative output phrases
TPI – # training phrases per intent
WPTP – # words per training phrase
VPTP – # verbs per training phrase
PPTP – # parameters per training phrase
WPOP – # words per output phrase
VPOP – # verbs per output phrase
CPOP – # characters per output phrase
READ – reading time of the output phrases
LPE – # literals per entity
SPL – # synonyms per literal
WL – word length
FACT – # actions per flow
FPATH – # conversation flow paths
CL – conversation length

I will try to use these metrics if I write chatbot 🙂

Legacy code…

I stumbled across a great talk from Dylan Beattie about legacy code. It is a pre-pandemic talk, but it opens up with a great song and talks about legacy code differently than what we usually do.

There is a lot of great material and food for thought in this video, but I would like to turn your attention to minute 26, where Dylan talks about Excel and how the world runs on it.

He says that a lot of things are actually built on top of Excel because it is essentially a functional language of sorts. The software developed on top of Excel is also the software that is NOT written by professional programmers and software engineers. Yet, it is prevalent in modern society.

Don’t get me wrong. I am in favor of Excel. Love the tool and what Microsoft has done with it. It is so flexible that it can be used with almost all programming environments – from the built-in VBA (I know, ancient history), to Python or C#. We’ve done our share of Excel programming back in a day, e.g. designed measurement systems based on it: A framework for developing measurement systems and its industrial evaluation – ScienceDirect

I agree, the tool is not perfect, but it is installed on ALL office computers and can be executed by anybody. Just open up the file and run it. That’s why we chose it for the measurement systems. Well, at least until we had to do a big rewrite and go to SQL, dashboards, etc…

As I said – history.

Can we predict quality of service level quality based on metrics?

link to paper: https://doi.org/10.1016/j.infsof.2020.106313

Image by 3D Animation Production Company from Pixabay

Understanding how your product performs in the field is a very hot topic. To be honest, it’s always been. When we design and develop our product, we often want to know whether the outcome is going to be good or not. Well, this is not an easy task because we cannot really know to runtime properties of our products before very close to the final version. It’s difficult to measure end-to-end response time if we only have half of the database, or if we cannot simulate the full load of the product.

In this paper, the authors analysed over 700 web services and measured their code quality and interface quality as predictors for the quality of service.

From the abstract: source code and interface quality metrics/antipatterns are correlated with web service quality attributes (response time, availability, throughput, successability, reliability, compliance, best practices, latency, and documentation).

The internal code/interface quality measures were:

NPT: Number of port types, Interface
NOPT: Average number of operations in port types, Interface
NBS: Number of services, Interface
NIPT: Number of identical port types, Interface
NIOP: Number of identical operations, Interface
ALPS: Average length of port-types signature, Interface
AMTO: Average meaningful terms in operation names, Interface
AMTM: Average meaningful terms in message names, Interface
AMTMP: Average meaningful terms in message parts, Interface
AMTP: Average meaningful terms in port-type names, Interface
NOD: Number of operations declared, Code
NAOD: Number of accessor operations declared, Code
ANIPO: Average number of input parameters in operations, Code
ANOPO: Average number of output parameters in operations, Code
NOM: Number of messages, Code
NBE: Number of elements of the schemas, Code
NCT: Number of complex types, Code
NST: Number of primitive types, Code
NBB: Number of bindings, Code
NPM: Number of parts per message, Code
COH Cohesion: The degree of the functional relatedness of the operations of the service, Code
COU Coupling: A measure of the extent to which inter-dependencies exist between the service modules, Code
ALOS: Average length of operations signature, Code
ALMS: Average length of message signature, Code

This is quite a collection of measures and they are quite interesting, e.g. the average meaningful terms in port-type names. I must admit that it’s a new measure that I’ve not seen before.

The measures of the quality of service were:

Response Time: Time taken to send a request and receive a response, QoS
Availability: How often is the service available for consumption, QoS
Throughput: Total Number of invocations for a given period of time, QoS
Successability: Number of response / number of request messages, QoS
Reliability: Ratio of the number of error messages to total messages, QoS
Compliance: The extent to which a WSDL document follows WSDL specification, QoS
Best Practices: The extent to which a web service follows WS-I Basic Profile, QoS
Latency: Time taken for the server to process a given request, QoS
Documentation: Measure of documentation (i.e. description tags) in WSDL, QoS

Some of these are also quite interesting, e.g. Successability, which is the number of response messages per request messages.

The authors also measured some anti patterns of service design, which they list in the paper. I will not go through these anti-patterns, but I think that they are also correlated with some of the metrics.

I suggest this reading to everyone interested in web services and service design. Perhaps this could help us to get more high-performance services. I hope that I can see more of this type of research in the domain of service security – that’s an interesting area in itself.

Your code and AI – more than precision and recall!

Using machine learning and AI to improve your coding is an important area of research. Together with colleagues we work with these techniques, to take them from open source to more industry quality.

There are two great tools that one can use today already. One of the tools is a beta version of add-in for visual studio, which helps software engineers to write code.

https://www.microsoft.com/en-us/ai/ai-lab-code-defect

Microsoft is very active in this area and even has release a set of tools that support the development of AI systems: https://www.microsoft.com/en-us/research/project/visual-studio-code-tools-ai/

Also:

https://techcommunity.microsoft.com/t5/educator-developer-blog/visual-studio-code-tools-for-ai-extension/ba-p/379420

What is great is that the tools are, naturally, available freely!

Another tool is a DeepCode, which analyzes software code and provides suggestions to improve it – e.g. use a specific design pattern or refactoring.

https://www.deepcode.ai/

This is great that we have increasingly more tools and that AI engineering matures. We do not want to have precision and recall steer our development. We want to have real testing and real systems. We also need to work with data quality in order to ensure that the systems are reliable.

The alternative is that we use MCC, precision, recall, F1-score to tell us how good a system is, which is not entirely true. These measures do not provide any view on how the system reflects the requirements put on it. These measures allow us to compare different classifiers, but not systems.

I hope that we can focus more discussion on AI quality and not classification quality/accuracy.

Actionable software metrics – an interesting new article

https://dl.acm.org/doi/pdf/10.1145/3383219.3383244

Working with metrics is a domain which calls for empirical data, which constantly changes. Software companies evolve and their metric programs evolve. I’ve always been interested how the metrics data is used in companies, especially in other geographical regions than the nordics. Although there are differences between companies in Sweden, Danmark or Finland, these companies are still more similar to each other than companies within the same domain in other parts of the world – and that’s perfectly fine.

In this paper, the authors captured my attention because they studied a few companies that were not on my radar before. The author have also found an interesting angle on the metrics work – what makes a metric actionable?

As it turns out, there are a few things that make the metrics actionable:

being practical
inform decision-making, and
exhibit data quality

These three characteristics are very important and agreed upon by most of the respondents. The authors also recognise the need of the metrics to be temporal, i.e. relevant for the information needs at hand.

What I also liked about the paper is that they provide the link to their data, which is the set of metrics used by the studied companies – a very interesting list: https://doi.org/10.5281/zenodo.3580893

However, what is interesting, and contradictory to what we have observed in our work, is the fact that the metrics which are focused on specific projects/products or universal, were not that popular. Instead, the metrics should be applicable for multiple products/projects, i.e. the type of the measured entity should contain more than one instance.

So, are the non-actionable metrics a complete waste of time then? Well, I would not say so. Neither do the authors. The non-actionable metrics can still be informative. They can be used to raise awareness of the issue, or simply provide the means for monitoring of the situation or the product, without the need to trigger specific decisions. Examples – product sales numbers, customer satisfaction, etc. Hard to act on them directly, but very important to collect and monitor.

Emerging new field – DUO – Data mining and optimization (EMSE article)

New sub-areas or fields within software engineering are not that common, but they come up once in a while. The authors of this article (https://doi-org.ezproxy.ub.gu.se/10.1007/s10664-020-09808-9, Better software analytics via “DUO”: Data mining algorithms using/used-by optimizers) argue that this is the case now.

In this article, the authors provide a view that data mining and building optimization models are done in tandem and that this is the new field. They show that the data mined from repositories influences optimization models and that the development of models influences data mining.

The authors make the following claims (quoted from the paper, references removed):

Claim1:For software engineering tasks, optimization and data mining are very similar. Hence, it is natural and simple to combine the two methods.
Claim2:For software engineering tasks. optimizers can greatly improve data miners. A data miner’s default tuners can lead to sub-optimal performance. Automatic optimizers can find tunings that dramatically improve that performance.
Claim3:For software engineering tasks, data miners can greatly improve optimization. If a data miner groups together related items, an optimizer can explore and report conclusions that are general across a set of solutions. Further, optimization for SE problems can be very slow. But if that optimization executes over the groupings found by a data miner, that inference can terminate orders of magnitude faster.
Claim4:For software engineering tasks, data mining without optimization is not recommended. Conclusions reached from an unoptimized data miner can be changed, sometimes even dramatically improved, by running the same tuned learner on the same data. Researchers in data mining should, therefore, consider adding an optimization step to their analysis.

These claims make a lot of sense and they are aligned with my observations. I recommend this article for everyone who is working at or developing a metric team or a data analysis/data science team.

Testing ML applications

Image by Gordon Johnson from Pixabay

Code: https://github.com/lawrence415610/Mtkeras

I’ve recently looked at the applications of different testing techniques for testing ML applications and got interested in the so called metamorphic testing. The idea is that we can check whether an output is within a specific range or set, which is called a metamorphic relation ( https://medium.com/trustableai/testing-ai-with-metamorphic-testing-61d690001f5c).

What is interesting about this paper is that it presents a framework for testing ML applications. I’ve not tried it yet, but I will as it seems very interesting to check how things work with this metamorphic testing and metamorphic relations. I’ve also interested in how to measure the quality of the software in this context.

https://www.researchgate.net/profile/Zhi_Quan_Zhou/publication/340487456_A_Testing_Tool_for_Machine_Learning_Applications/links/5e8c7ee14585150839c682b9/A-Testing-Tool-for-Machine-Learning-Applications.pdf

Succeeding with large scale measurement programs

This week we had the possibility to give a webinar about how to work with large scale measurement programs. The webinar was dedicated for everyone who works with software metrics and would like to get more impact from that work.

It is not so much about the numbers, it is about the impact and what the numbers mean. The webinar that we present, provides a good understanding of how to make this impact. Based on our experiences, we chose all one needs to know to implement a measurement program in few weeks rather than years.

The webinar has been recorded and is available at this link: https://www.youtube.com/watch?v=2ChaVT_3djE&feature=youtu.be

Recording from the webinar about how to succeed with measurement programs.

Engineers and scientists love to measure. We measure complexity of software, its performance, size and maintainability (just to name a few). We need these measurements in order to construct software, manager organizations or release high quality, high reliable products. However, there is a difference between measuring software aspects and using the measures in decision processes. In this talk, we present the concept of measurement program, measurement system, information quality and indicator-triggered decisions. We show what to consider when setting up measurement programs and provide a hints about the costs and benefits of having the program. We end the talk with presenting recent research results from Software Center, where we combine measurements and machine learning to speed-up software development.

More materials about this are available here:

Book describing our experiences: https://www.springer.com/gp/book/9783319918358
PhD course about software measurement: https://www.springer.com/gp/book/9783319918358
Our website: www.metrics.se
Website of our collaboration: www.software-center.se

A while back we gave a webinar with a similar title, where we focused on the questions concerning the measurement infrastructure, visualization and assessment of the measurement program. The ACM webinar is presented here: