Succeeding with large scale measurement programs

This week we had the possibility to give a webinar about how to work with large scale measurement programs. The webinar was dedicated for everyone who works with software metrics and would like to get more impact from that work.

It is not so much about the numbers, it is about the impact and what the numbers mean. The webinar that we present, provides a good understanding of how to make this impact. Based on our experiences, we chose all one needs to know to implement a measurement program in few weeks rather than years.

The webinar has been recorded and is available at this link: https://www.youtube.com/watch?v=2ChaVT_3djE&feature=youtu.be

Recording from the webinar about how to succeed with measurement programs.

Engineers and scientists love to measure. We measure complexity of software, its performance, size and maintainability (just to name a few). We need these measurements in order to construct software, manager organizations or release high quality, high reliable products. However, there is a difference between measuring software aspects and using the measures in decision processes. In this talk, we present the concept of measurement program, measurement system, information quality and indicator-triggered decisions. We show what to consider when setting up measurement programs and provide a hints about the costs and benefits of having the program. We end the talk with presenting recent research results from Software Center, where we combine measurements and machine learning to speed-up software development.

More materials about this are available here:

A while back we gave a webinar with a similar title, where we focused on the questions concerning the measurement infrastructure, visualization and assessment of the measurement program. The ACM webinar is presented here:

Classifying code smells…

https://link-springer-com.ezproxy.ub.gu.se/article/10.1007%2Fs11219-020-09498-y

Image by Comfreak from Pixabay

Code smells are quite interesting phenomena to study. They are not really defects, but they are not good code either. They exist, but people rarely want to admit to them. There is also no consensus to how much effort it takes to remove them (or even whether they should be removed or just avoided).

In this paper, the authors study whether it is possible to use ML to find code smells. It turns out it is possible and the accuracy is quite high (over 95%). It also shows that sometimes it is better to show a number of recommendations (e.g. two potential smells) rather than one – it requires less accuracy to make the recommendation, but helps the users to narrow-down their solution spaces.

How do we know if something is popular…

Investigating diversity and impact of the popularity metrics for ranking software packages (review): https://onlinelibrary-wiley-com.ezproxy.ub.gu.se/doi/pdfdirect/10.1002/smr.2265

Image from Pixabay

I’ve written about the ways of assessing how good software is. One of the modern approaches, which I talked about before, is the use of A/B testing and online experiments. Providing the users with different versions of the features/systems/use cases allows the company to understand which of the options provides the best response from the users.

However, there are a number of challenges with this approach – the most prominent being the potential existence of confounding factors. Even if the results show a positive/negative response, we do not really know whether the response is not caused by something else (for example by users being tired, changes in the environment, etc.)

After using GitHub, both as a user and as a researcher, I sometimes wondered whether the star system is actually the right one. I wondered whether we should use a sort-of A/B testing system where we could check how often people usually access certain repositories.

In this paper, the authors take a look at different ways of assessing popularity of repositories. The results show that regardless of the metrics, the popular repositories are popular – i.e. popularity is not dependent of a metric.

Popularity metrics studied:

  • Total number of downloads of the package
  • Number of projects dependent on the package
  • Number of repositories dependent on the package
  • Source rank of the package
  • Number of forks
  • Number of watchers
  • Number of contributors
  • Number of stars
  • Number of open issues
  • Total number of tags

The actual analysis is quite interesting, so I recommend to take a look at the paper directly.

Using machine learning to understand the quality of requirements

Image by Hans Braxmeier from Pixabay

https://link-springer-com.ezproxy.ub.gu.se/article/10.1007%2Fs11219-020-09511-4

Working with software requirements and metrics is an important part of research in modern software companies. Although many of the companies are Agile or post-Agile, claiming that they do not have requirements, they still capture user needs in textual forms. For example, they describe user stories, epic, use cases.

This paper is an interesting view on the software requirements quality assessment. Instead of just calculating metrics and creating quality models, they use machine learning to mimic the way in which experts judge what is a good requirement and what is not. They use quality functions, and several of them, to distinguish between the good and bad requirements. Using multiple functions, in a multidimensional space, allows to select groups of requirements that are separated by the other class – the figures in the paper show more how this works in practice.

The summary of the gist of the paper is actually presented best in the introduction (quote): “Summing up, we can compute a set of quantitative metrics of textual requirements, and through them, we can assess the quality of requirements. However, the risk of this approach is to build assessment methods and tools that are both arbitrary in the parameterization of metrics and rigid in the combination of metrics to evaluate the different properties. This is why we propose in this work to develop a flexible assessment method that can be adapted to different contexts, with a high degree of automation. The method consists basically in the emulation of the experts’ judgment on quality through artificial intelligence techniques: first, obtain the expert’s implicit quality function through machine learning, and, second, apply this function to automatically assess the quality of textual requirements.

Our approach to emulate the experts’ judgment, as explained later in detail, is based on well-known machine learning techniques: we have a computer tool learn from a previous human-made classification of requirements according to their quality. Therefore, our work’s intent is not to improve machine learning techniques, but rather to devise a novel application to the field of requirements quality assessment.”

I strongly recommend to read the paper as it provides very good methods to work with requirements quality in many modern organisations.

Quality of Deep Learning code – article review

A (deep) Staircase in Vatican, Image by JEROME CLARYSSE from Pixabay

http://swat.polymtl.ca/~foutsekh/docs/hadhemi-MSR2020.pdf

Deep learning models are often designed, trained and tested in Python. It is a language with a nice structure, quite straigtforward syntax and a lot of libraries. However, very few tutorials about deep learning (or any Python programming tutorials) discuss the quality of the code, e.g. its modularization, encapsulation, naming consistency.

As a result, a lot of code for machine learning, written in Python, often is hard to read and hard to grasp. Even if used as part of jupyter notebooks, the code is not really commented (often).

The study behind the link above is a study that supports my long gut feeling about this. The findings show that (from the abstract): First, long lambda expression, long ternary conditional expression, and complex container comprehension smells are frequently found in deep learning projects. That is, deep learning code involves more complex or longer expressions than the traditional code does. Second, the number of code smells increases across the releases of deep learning applications. Third, we found that there is a co-existence between code smells and software bugs in the studied deep learning code, which confirms our conjecture on the degraded code quality of deep learning applications.

The second finding, about the constant increase of the number of code smells, is similar to the studies we did in proprietary software about complexity – the complexity “never” decreases ( http://web.student.chalmers.se/~vard/files/Monitoring%20Complexity%20Evolution.pdf ).

The study compares 59 deep learning systems with 59 non-ML systems from GitHub. One could argue that the sample is not representative (no propprietary systems), but it is a fair sample.

To sum up, a very nice reading, showing that we need to think about quality, not only models, but also code quality.

Evidence of improvement using Agile…

Towards the end of the year I’d like to make a small reflection on Agile software development. It’s been discussed for a number of years now, yet the evidence of bringing measurable results is rather scarce. Here is one article from Åby Academy in Finland which studies a transformation of a large company to Agile: https://www.researchgate.net/profile/Marta_Olszewska_Plaska/publication/280711876_Did_it_actually_go_this_well_a_Large-Scale_Case_Study_on_an_Agile_Transformation/links/55c1d7ea08aeb28645819d3f.pdf

Studied case: Ericsson

Size: ca. 350 people

Product: roughly 10 years old

Languages: RoseRT, C++, Java

Summary of results: Agile software development provided more features (5x) and faster (60%).

What I like about the paper is that it provides the measurement before the transformation, DURING the transformation and after. Very interesting reading!

Do SysML requirement diagrams help?

Today I’ve had a privilege to present a paper at EASE 2014 done in collaboration with University of Basilicata in Italy.

Link to presentation

The paper is an experimental validation of whether requirement diagrams speed up the understanding of requirement specifications or whether they increase/decrease comprehension. The results show that the comprehension is increased while there is no change in time.

Evolution of Long-Term Industrial – new paper

Darko Durisic has done an interesting work on the evolution of industrial-class meta-models. The work has been accepted as full paper at SEAA (Software Engineering for Advanced Applications) Euromicro Conference.

Title: Evolution of Long-Term Industrial Meta-Models – A Case Study

Abstract: Meta-models in software engineering are used to define properties of models. Therefore the evolution of the metamodels influences the evolution of the models and the software instantiated from them. The evolution of the meta-models is particularly problematic if the software has to instantiate two versions of the same meta-model – a situation common for longterm software development projects such as car development projects. In this paper, we present a case study of the evolution of the standardized meta-model used in the development of the automotive software systems – the AUTOSAR meta-model – at Volvo Car Corporation. The objective of this study is to assist the automotive software designers in planning long term development projects based on multiple AUTOSAR meta-model versions. We achieve this by visualizing the size and complexity increase between different versions of the AUTOSAR meta-model and by calculating the number of changes which need to be implemented in order to adopt a newer AUTOSAR meta-model version. The analysis is done for each major role in the Automotive development process affected by the changes.

Stay tuned for the full version of the paper and congrats to Darko!

Choosing reliability growth models…

In our recent research we’ve looked at a number of ways on how to support software development companies in working with reliability modelling.

I have come across this article on how to choose a model – a sys rev. They authors look at a number of criteria and evaluate which ones are the most used in choosing models. Nice and interesting reading.

Link to full text

Article highlight: Measures and external quality…

Article highlight: Empirical evidence on the link between object-oriented measures and external quality attributes: a systematic literature review
Link to full text

This article presents an interesting systematic review where the authors set off to look for evidence of correlation between OO metrics and quality. What I like about this paper:

  • nice overview of which metrics suites exist for OO programs
  • nice overview which external quality metrics are used
  • essentially only 99 studies exist which have the right scope and quality
  • the number of studies seem to be growing – even in the past 2-3 years
  • the 20-year old CK suite is still the most popular one

Recommended reading for those who want to see which metrics are the best predictors, when and why.