PHANTOM – finding well engineered software projects, fast…

Image by 2427999 from Pixabay

I’ve worked with two great students – Peter and Joshua – who wanted to do something interesting. They developed a tool that could replicate a study from other researchers. However, they did it faster and with less data. We also managed to team up with Mirek from Poznan who improved the classification algorithm and asked his colleagues from new, industrial data.

And this is the outcome – a tool that can connect to a git repository and recognise whether your project is well engineered or not. It helps companies to understand whether their teams are working in a structured manner or ad-hoc.

The tool provides the possibility to assess whether a specific repository is in need for maintenance or not.


Context: Within the field of Mining Software Repositories, there are numerous methods employed to filter datasets in order to avoid analysing low-quality projects. Unfortunately, the existing filtering methods have not kept up with the growth of existing data sources, such as GitHub, and researchers often rely on quick and dirty techniques to curate datasets.

Objective: The objective of this study is to develop a method capable of filtering large quantities of software projects in a resource-efficient way.

Method: This study follows the Design Science Research (DSR) methodology. The proposed method, PHANTOM, extracts five measures from Git logs. Each measure is transformed into a time-series, which is represented as a feature vector for clustering using the k-means algorithm.

Results: Using the ground truth from a previous study, PHANTOM was shown to be able to rediscover the ground truth on the training dataset, and was able to identify “engineered” projects with up to 0.87 Precision and 0.94 Recall on the validation dataset. PHANTOM downloaded and processed the metadata of 1,786,601 GitHub repositories in 21.5 days using a single personal computer, which is over 33% faster than the previous study which used a computer cluster of 200 nodes. The possibility of applying the method outside of the open-source community was investigated by curating 100 repositories owned by two companies.

Conclusions: It is possible to use an unsupervised approach to identify engineered projects. PHANTOM was shown to be competitive compared to the existing supervised approaches while reducing the hardware requirements by two orders of magnitude.

Can we predict quality of service level quality based on metrics?

link to paper:

Image by 3D Animation Production Company from Pixabay

Understanding how your product performs in the field is a very hot topic. To be honest, it’s always been. When we design and develop our product, we often want to know whether the outcome is going to be good or not. Well, this is not an easy task because we cannot really know to runtime properties of our products before very close to the final version. It’s difficult to measure end-to-end response time if we only have half of the database, or if we cannot simulate the full load of the product.

In this paper, the authors analysed over 700 web services and measured their code quality and interface quality as predictors for the quality of service.

From the abstract: source code and interface quality metrics/antipatterns are correlated with web service quality attributes (response time, availability, throughput, successability, reliability, compliance, best practices, latency, and documentation).

The internal code/interface quality measures were:

  • NPT: Number of port types, Interface
  • NOPT: Average number of operations in port types, Interface
  • NBS: Number of services, Interface
  • NIPT: Number of identical port types, Interface
  • NIOP: Number of identical operations, Interface
  • ALPS: Average length of port-types signature, Interface
  • AMTO: Average meaningful terms in operation names, Interface
  • AMTM: Average meaningful terms in message names, Interface
  • AMTMP: Average meaningful terms in message parts, Interface
  • AMTP: Average meaningful terms in port-type names, Interface
  • NOD: Number of operations declared, Code
  • NAOD: Number of accessor operations declared, Code
  • ANIPO: Average number of input parameters in operations, Code
  • ANOPO: Average number of output parameters in operations, Code
  • NOM: Number of messages, Code
  • NBE: Number of elements of the schemas, Code
  • NCT: Number of complex types, Code
  • NST: Number of primitive types, Code
  • NBB: Number of bindings, Code
  • NPM: Number of parts per message, Code
  • COH Cohesion: The degree of the functional relatedness of the operations of the service, Code
  • COU Coupling: A measure of the extent to which inter-dependencies exist between the service modules, Code
  • ALOS: Average length of operations signature, Code
  • ALMS: Average length of message signature, Code

This is quite a collection of measures and they are quite interesting, e.g. the average meaningful terms in port-type names. I must admit that it’s a new measure that I’ve not seen before.

The measures of the quality of service were:

  • Response Time: Time taken to send a request and receive a response, QoS
  • Availability: How often is the service available for consumption, QoS
  • Throughput: Total Number of invocations for a given period of time, QoS
  • Successability: Number of response / number of request messages, QoS
  • Reliability: Ratio of the number of error messages to total messages, QoS
  • Compliance: The extent to which a WSDL document follows WSDL specification, QoS
  • Best Practices: The extent to which a web service follows WS-I Basic Profile, QoS
  • Latency: Time taken for the server to process a given request, QoS
  • Documentation: Measure of documentation (i.e. description tags) in WSDL, QoS

Some of these are also quite interesting, e.g. Successability, which is the number of response messages per request messages.

The authors also measured some anti patterns of service design, which they list in the paper. I will not go through these anti-patterns, but I think that they are also correlated with some of the metrics.

I suggest this reading to everyone interested in web services and service design. Perhaps this could help us to get more high-performance services. I hope that I can see more of this type of research in the domain of service security – that’s an interesting area in itself.

Your code and AI – more than precision and recall!

Image by Daniel Hannah from Pixabay

Using machine learning and AI to improve your coding is an important area of research. Together with colleagues we work with these techniques, to take them from open source to more industry quality.

There are two great tools that one can use today already. One of the tools is a beta version of add-in for visual studio, which helps software engineers to write code.

Microsoft is very active in this area and even has release a set of tools that support the development of AI systems:


What is great is that the tools are, naturally, available freely!

Another tool is a DeepCode, which analyzes software code and provides suggestions to improve it – e.g. use a specific design pattern or refactoring.

This is great that we have increasingly more tools and that AI engineering matures. We do not want to have precision and recall steer our development. We want to have real testing and real systems. We also need to work with data quality in order to ensure that the systems are reliable.

The alternative is that we use MCC, precision, recall, F1-score to tell us how good a system is, which is not entirely true. These measures do not provide any view on how the system reflects the requirements put on it. These measures allow us to compare different classifiers, but not systems.

I hope that we can focus more discussion on AI quality and not classification quality/accuracy.

Engineering AI systems – differences to engineering “other” software systems…

Image by Free-Photos from Pixabay

Being a software engineer working with AI for a while, I noticed that the engineering of AI systems is different. Well, maybe not building the actual system, but the way in which the knowledge about quality, testing and maintenance differ.

In this article, IEEE Software’s Editor in Chief presents her view on the topic. The main point is that this engineering is both similar and different. This quote from the paper summarizes it nicely: “I argue that our existing design techniques will not only help us make progress in understanding how to design, deploy, and sustain the structure and behavior of AI-enabled systems, but they are also essential starting points. I suggest that what is different in AI engineering is, in essence, the quality attributes for which we need to design and analyze, not necessarily the design and engineering techniques we rely on. “

One of the differences is the process of development. It is not aligned with the non-ML systems, e.g. in terms of training, testing, maintenance. ML systems are data-centric and this needs to be reflected in the AI engineering processes.

Ipek Ozkaya discusses the following misconceptions about the differences:

  • We can specify systems – both AI and non-AI systems cannot really be fully specified,
  • System correctness can be verified – we can never fully verify systems, neither AI-based on non-AI based (e.g. due to complexity),
  • We can avoid hidden dependencies,
  • We can manage system change propagation,
  • Frameworks do it all,
  • We can build reliable systems from unreliable and unpredictable subcomponents

I recommend this article to get a quick overview of the gist of the differences and misconceptions.

Problems with engineering AI systems

Image by Free-Photos from Pixabay

Engineering machine learning systems is much more than train-evaluate cycles. It means that we need to systematically integrate these ML systems with the rest of the component. We need to build safety-cages to ensure that the decisions are not out-of-bounds and we need to make sure that we can maintain these systems.

In this paper, the authors studied an example of automated driving vehicles, not fully autonomous (but still) and shown the challenges that we need to solve before AI and ML becomes one of our “fellow drivers” on the roads.

The findings of the paper show that it’s not going to happen soon. As the authors say in the abstract: “Our results show that machine learning models are characterized by a lack of requirements specification, lack of design specification, lack of interpretability, and lack of robustness. We also perform a gap analysis on a conventional system quality standard SQuaRE with the characteristics of machine learning models to study quality models for machine learning systems. We find that a lack of requirements specification and lack of robustness have the greatest impact on conventional quality models. “

The authors provide a process for machine learning models as part of safety critical software, where the designing of the system and its real-scenario validation are a bit more apart than traditionally.

What I really like about the paper is the gap analysis of ML systems and ISO 25000 quality model. For example, as shown in this table:

Actionable software metrics – an interesting new article

Image by Pexels from Pixabay

Working with metrics is a domain which calls for empirical data, which constantly changes. Software companies evolve and their metric programs evolve. I’ve always been interested how the metrics data is used in companies, especially in other geographical regions than the nordics. Although there are differences between companies in Sweden, Danmark or Finland, these companies are still more similar to each other than companies within the same domain in other parts of the world – and that’s perfectly fine.

In this paper, the authors captured my attention because they studied a few companies that were not on my radar before. The author have also found an interesting angle on the metrics work – what makes a metric actionable?

As it turns out, there are a few things that make the metrics actionable:

  • being practical
  • inform decision-making, and
  • exhibit data quality

These three characteristics are very important and agreed upon by most of the respondents. The authors also recognise the need of the metrics to be temporal, i.e. relevant for the information needs at hand.

What I also liked about the paper is that they provide the link to their data, which is the set of metrics used by the studied companies – a very interesting list:

However, what is interesting, and contradictory to what we have observed in our work, is the fact that the metrics which are focused on specific projects/products or universal, were not that popular. Instead, the metrics should be applicable for multiple products/projects, i.e. the type of the measured entity should contain more than one instance.

So, are the non-actionable metrics a complete waste of time then? Well, I would not say so. Neither do the authors. The non-actionable metrics can still be informative. They can be used to raise awareness of the issue, or simply provide the means for monitoring of the situation or the product, without the need to trigger specific decisions. Examples – product sales numbers, customer satisfaction, etc. Hard to act on them directly, but very important to collect and monitor.

What to discuss about Deep Learning? – an EMSE article review

Image by Bilderandi from Pixabay 

A study about what the developers discuss regarding Deep Learning and whether this differs across different frameworks is an interesting summer discussion topic:

The paper reviews the comments of developers who comment and/or post questions about three deep learning frameworks: Theano, Tensorflow and PyTorch. I’ve got interested in the paper because I wanted to see whether the communities using these frameworks differ. Myself, I’ve got introduced to Tensorflow a while back and keps using it. Since I’m not an ML researcher, the framework does not really matter for me, but I still would like to know whether I should read upon some new framework during the summer.

The observations quoted from the abstract:

1) a wide range of topics that are discussed about the three deep learning frameworks on both platforms, and the most popular workflow stages are Model Training and Preliminary Preparation.

2) the topic distributions at the workflow level and topic category level on Tensorflow and PyTorch are always similar while the topic distribution pattern on Theano is quite different. In addition, the topic trends at the workflow level and topic category level of the three deep learning frameworks are quite different.

3) the topics at the workflow level show different trends across the two platforms. e.g., the trend of the Preliminary Preparation stage topic on Stack Overflow comes to be relatively stable after 2016, while the trend of it on GitHub shows a stronger upward trend after 2016.

It’s interesting that the topics are roughly the same, but I’m a bit surprised that the topics are mostly about the data management/machine learning and not the frameworks themselves. This means that applications win over development of the frameworks – at least at the moment.

Testing ML applications

Image by Gordon Johnson from Pixabay


I’ve recently looked at the applications of different testing techniques for testing ML applications and got interested in the so called metamorphic testing. The idea is that we can check whether an output is within a specific range or set, which is called a metamorphic relation (

What is interesting about this paper is that it presents a framework for testing ML applications. I’ve not tried it yet, but I will as it seems very interesting to check how things work with this metamorphic testing and metamorphic relations. I’ve also interested in how to measure the quality of the software in this context.

Succeeding with large scale measurement programs

This week we had the possibility to give a webinar about how to work with large scale measurement programs. The webinar was dedicated for everyone who works with software metrics and would like to get more impact from that work.

It is not so much about the numbers, it is about the impact and what the numbers mean. The webinar that we present, provides a good understanding of how to make this impact. Based on our experiences, we chose all one needs to know to implement a measurement program in few weeks rather than years.

The webinar has been recorded and is available at this link:

Recording from the webinar about how to succeed with measurement programs.

Engineers and scientists love to measure. We measure complexity of software, its performance, size and maintainability (just to name a few). We need these measurements in order to construct software, manager organizations or release high quality, high reliable products. However, there is a difference between measuring software aspects and using the measures in decision processes. In this talk, we present the concept of measurement program, measurement system, information quality and indicator-triggered decisions. We show what to consider when setting up measurement programs and provide a hints about the costs and benefits of having the program. We end the talk with presenting recent research results from Software Center, where we combine measurements and machine learning to speed-up software development.

More materials about this are available here:

A while back we gave a webinar with a similar title, where we focused on the questions concerning the measurement infrastructure, visualization and assessment of the measurement program. The ACM webinar is presented here:

Classifying code smells…

Image by Comfreak from Pixabay

Code smells are quite interesting phenomena to study. They are not really defects, but they are not good code either. They exist, but people rarely want to admit to them. There is also no consensus to how much effort it takes to remove them (or even whether they should be removed or just avoided).

In this paper, the authors study whether it is possible to use ML to find code smells. It turns out it is possible and the accuracy is quite high (over 95%). It also shows that sometimes it is better to show a number of recommendations (e.g. two potential smells) rather than one – it requires less accuracy to make the recommendation, but helps the users to narrow-down their solution spaces.