Succeeding with large scale measurement programs

This week we had the possibility to give a webinar about how to work with large scale measurement programs. The webinar was dedicated for everyone who works with software metrics and would like to get more impact from that work.

It is not so much about the numbers, it is about the impact and what the numbers mean. The webinar that we present, provides a good understanding of how to make this impact. Based on our experiences, we chose all one needs to know to implement a measurement program in few weeks rather than years.

The webinar has been recorded and is available at this link: https://www.youtube.com/watch?v=2ChaVT_3djE&feature=youtu.be

Recording from the webinar about how to succeed with measurement programs.

Engineers and scientists love to measure. We measure complexity of software, its performance, size and maintainability (just to name a few). We need these measurements in order to construct software, manager organizations or release high quality, high reliable products. However, there is a difference between measuring software aspects and using the measures in decision processes. In this talk, we present the concept of measurement program, measurement system, information quality and indicator-triggered decisions. We show what to consider when setting up measurement programs and provide a hints about the costs and benefits of having the program. We end the talk with presenting recent research results from Software Center, where we combine measurements and machine learning to speed-up software development.

More materials about this are available here:

A while back we gave a webinar with a similar title, where we focused on the questions concerning the measurement infrastructure, visualization and assessment of the measurement program. The ACM webinar is presented here:

How do we know if something is popular…

Investigating diversity and impact of the popularity metrics for ranking software packages (review): https://onlinelibrary-wiley-com.ezproxy.ub.gu.se/doi/pdfdirect/10.1002/smr.2265

Image from Pixabay

I’ve written about the ways of assessing how good software is. One of the modern approaches, which I talked about before, is the use of A/B testing and online experiments. Providing the users with different versions of the features/systems/use cases allows the company to understand which of the options provides the best response from the users.

However, there are a number of challenges with this approach – the most prominent being the potential existence of confounding factors. Even if the results show a positive/negative response, we do not really know whether the response is not caused by something else (for example by users being tired, changes in the environment, etc.)

After using GitHub, both as a user and as a researcher, I sometimes wondered whether the star system is actually the right one. I wondered whether we should use a sort-of A/B testing system where we could check how often people usually access certain repositories.

In this paper, the authors take a look at different ways of assessing popularity of repositories. The results show that regardless of the metrics, the popular repositories are popular – i.e. popularity is not dependent of a metric.

Popularity metrics studied:

  • Total number of downloads of the package
  • Number of projects dependent on the package
  • Number of repositories dependent on the package
  • Source rank of the package
  • Number of forks
  • Number of watchers
  • Number of contributors
  • Number of stars
  • Number of open issues
  • Total number of tags

The actual analysis is quite interesting, so I recommend to take a look at the paper directly.

Using machine learning to understand the quality of requirements

Image by Hans Braxmeier from Pixabay

https://link-springer-com.ezproxy.ub.gu.se/article/10.1007%2Fs11219-020-09511-4

Working with software requirements and metrics is an important part of research in modern software companies. Although many of the companies are Agile or post-Agile, claiming that they do not have requirements, they still capture user needs in textual forms. For example, they describe user stories, epic, use cases.

This paper is an interesting view on the software requirements quality assessment. Instead of just calculating metrics and creating quality models, they use machine learning to mimic the way in which experts judge what is a good requirement and what is not. They use quality functions, and several of them, to distinguish between the good and bad requirements. Using multiple functions, in a multidimensional space, allows to select groups of requirements that are separated by the other class – the figures in the paper show more how this works in practice.

The summary of the gist of the paper is actually presented best in the introduction (quote): “Summing up, we can compute a set of quantitative metrics of textual requirements, and through them, we can assess the quality of requirements. However, the risk of this approach is to build assessment methods and tools that are both arbitrary in the parameterization of metrics and rigid in the combination of metrics to evaluate the different properties. This is why we propose in this work to develop a flexible assessment method that can be adapted to different contexts, with a high degree of automation. The method consists basically in the emulation of the experts’ judgment on quality through artificial intelligence techniques: first, obtain the expert’s implicit quality function through machine learning, and, second, apply this function to automatically assess the quality of textual requirements.

Our approach to emulate the experts’ judgment, as explained later in detail, is based on well-known machine learning techniques: we have a computer tool learn from a previous human-made classification of requirements according to their quality. Therefore, our work’s intent is not to improve machine learning techniques, but rather to devise a novel application to the field of requirements quality assessment.”

I strongly recommend to read the paper as it provides very good methods to work with requirements quality in many modern organisations.

Evidence of improvement using Agile…

Towards the end of the year I’d like to make a small reflection on Agile software development. It’s been discussed for a number of years now, yet the evidence of bringing measurable results is rather scarce. Here is one article from Åby Academy in Finland which studies a transformation of a large company to Agile: https://www.researchgate.net/profile/Marta_Olszewska_Plaska/publication/280711876_Did_it_actually_go_this_well_a_Large-Scale_Case_Study_on_an_Agile_Transformation/links/55c1d7ea08aeb28645819d3f.pdf

Studied case: Ericsson

Size: ca. 350 people

Product: roughly 10 years old

Languages: RoseRT, C++, Java

Summary of results: Agile software development provided more features (5x) and faster (60%).

What I like about the paper is that it provides the measurement before the transformation, DURING the transformation and after. Very interesting reading!

Measurement-as-a-Service (MaaS)

In the recent years we’ve seen a lot of discussions and good things about cloud computing – sharing platforms (PaaS), services (SaaS) and software thus optimizing the usage of computer resources.

This sharing of resources is important for making the software sustainable, and helps the companies to focus on what their business is about rather than on their IT infrastructure.

Measurement programs are no different – they are often a strategic value for companies, but they are not really something the companies want to spend their R&D budget for (at least not directly). So, how do we make it happen?

Well, we could use the same approach as in SaaS and PaaS and define MaaS (Measurement-as-a-Service) where we can reuse the knowledge across organizations and minimize the cost for working with the software measurement initiatives.

We’ve tried this concept with one of our industrial partners – Ericsson – and it seems that it works very well. You can read more about it in this article.

And the picture below explains a bit how this works.2015_MaaS_mensura.001

How to choose the right dashboard?

Dashboards and all kinds of radiators are very popular in industry now. They allow the companies to disseminate the metrics information and to find the right way of visualizing the metrics.

In a recent article written together with Ericsson and Volvo Cars we have explored how to find the right visualization and we developed a model for choosing the dashboard – http://gup.ub.gu.se/records/fulltext/220504/220504.pdf.

The method quantified a number of dimensions of a good dashboard and provides a simple set of sliders that can be used to select the right visualization. The companies in the study have found it to be a good input to the understanding of what the stakeholders want when they say “dashboard”.

In the next steps we’re currently working on defining a quality model of KPIs – Key Performance Indicators. The first version has shown that it allows the companies to reduce the number of indicators by as much as 90% by finding the ones which are not of good quality.

Dashboards.jpeg.001

How robust is a measurement program?

conceptual_model

In our recent work we have explored the possibility of validating that a measurement program is robust. We have worked with seven companies within the software center to establish a method and evaluate it. The results are presented in a newly accepted paper “MeSRAM – A Method for Assessing Robustness of Measurement Programs in Large Software Development Organizations and Its Industrial Evaluation” to appear in Journal of Systems and Software.

In short the method is based on a collecting the evidence that a measurement program contains elements which  are important for the program to be able to handle changes. For example whether a measurement program has a dedicated organization working with it and whether the entire company is able to utilize the results from the measurement program.

The method is similar to the stress-testing of banks, so popular in the last decade.

The next step in our research is finding out which metrics the companies should use to assure the long-term robustness  of the measurement program. stay tuned!

measurement_program_model

Which metrics are used in Agile and Lean software development?

When working with companies in different projects I often get a question which metrics should an Agile software development team use. The answer is of course – It depends what your team does… and then a set of questions from my side follow. These questions are designed to make me understand about the activities which the team does, the activities downstream of the process, the product, the process, etc.

I’ve recently looked into this article where the authors make a review of metrics used in agile teams. Although I’ve had high hopes for them, I got a bit disappointed – they were more or less the same metrics as any other team uses.

Review article: http://www.sciencedirect.com/science/article/pii/S095058491500035X

However, metrics like the release readiness (see:  our previous article from Ericsson) were not found….

I guess I need to search on…

Staron, Miroslaw, Wilhelm Meding, and Klas Palm. “Release Readiness Indicator for Mature Agile and Lean Software Development Projects.” Agile Processes in Software Engineering and Extreme Programming. Springer Berlin Heidelberg, 2012. 93-107.

Improving defect prediction – article highlight

Predicting defects has been on my mind for a while and I’ve been collecting evidence of good metrics which can improve accuracy of predictions.

In this article Madeyski and Jureczko have found one more metric – Number of Distinct Committers (NDC) which seems to improve prediction models. The link to the full article is here: Which Process Metrics Can Signicantly Improve Defect Prediction Models? An Empirical Study.

The empirical evaluation includes 27 open source projects and 6 industry projects. It’s great that there is an increased body of evidence combining both the open source and the industrial projects. Especially that the results seem to be consistent.