SE metrics (Software Engineering) – Page 16 – Software engineering, metrics, functional safety …

How do we know if something is popular…

Investigating diversity and impact of the popularity metrics for ranking software packages (review): https://onlinelibrary-wiley-com.ezproxy.ub.gu.se/doi/pdfdirect/10.1002/smr.2265

I’ve written about the ways of assessing how good software is. One of the modern approaches, which I talked about before, is the use of A/B testing and online experiments. Providing the users with different versions of the features/systems/use cases allows the company to understand which of the options provides the best response from the users.

However, there are a number of challenges with this approach – the most prominent being the potential existence of confounding factors. Even if the results show a positive/negative response, we do not really know whether the response is not caused by something else (for example by users being tired, changes in the environment, etc.)

After using GitHub, both as a user and as a researcher, I sometimes wondered whether the star system is actually the right one. I wondered whether we should use a sort-of A/B testing system where we could check how often people usually access certain repositories.

In this paper, the authors take a look at different ways of assessing popularity of repositories. The results show that regardless of the metrics, the popular repositories are popular – i.e. popularity is not dependent of a metric.

Popularity metrics studied:

Total number of downloads of the package
Number of projects dependent on the package
Number of repositories dependent on the package
Source rank of the package
Number of forks
Number of watchers
Number of contributors
Number of stars
Number of open issues
Total number of tags

The actual analysis is quite interesting, so I recommend to take a look at the paper directly.

Using machine learning to understand the quality of requirements

https://link-springer-com.ezproxy.ub.gu.se/article/10.1007%2Fs11219-020-09511-4

Working with software requirements and metrics is an important part of research in modern software companies. Although many of the companies are Agile or post-Agile, claiming that they do not have requirements, they still capture user needs in textual forms. For example, they describe user stories, epic, use cases.

This paper is an interesting view on the software requirements quality assessment. Instead of just calculating metrics and creating quality models, they use machine learning to mimic the way in which experts judge what is a good requirement and what is not. They use quality functions, and several of them, to distinguish between the good and bad requirements. Using multiple functions, in a multidimensional space, allows to select groups of requirements that are separated by the other class – the figures in the paper show more how this works in practice.

The summary of the gist of the paper is actually presented best in the introduction (quote): “Summing up, we can compute a set of quantitative metrics of textual requirements, and through them, we can assess the quality of requirements. However, the risk of this approach is to build assessment methods and tools that are both arbitrary in the parameterization of metrics and rigid in the combination of metrics to evaluate the different properties. This is why we propose in this work to develop a flexible assessment method that can be adapted to different contexts, with a high degree of automation. The method consists basically in the emulation of the experts’ judgment on quality through artificial intelligence techniques: first, obtain the expert’s implicit quality function through machine learning, and, second, apply this function to automatically assess the quality of textual requirements.

Our approach to emulate the experts’ judgment, as explained later in detail, is based on well-known machine learning techniques: we have a computer tool learn from a previous human-made classification of requirements according to their quality. Therefore, our work’s intent is not to improve machine learning techniques, but rather to devise a novel application to the field of requirements quality assessment.”

I strongly recommend to read the paper as it provides very good methods to work with requirements quality in many modern organisations.

Developing recommender systems – a framework which may just be the one (for us)…

https://rdcu.be/b3wgZ

Creating recommendation systems is a tricky task. We need to add the temporal domain to the data. In particular, we need to make sure that we capture what was recommended before to the specific user and how the user reacted upon that. We also need to capture the evolution of the users and the data.

In this paper, the authors present a framework, RectoLibry, which helps to construct these kind of systems. The system captures both the parts of the development of the recommendations, but also their deployment.

The system is based on designing an ontology (yes, my old, good friend, used since before Web 2.0, even in my own research: https://link.springer.com/chapter/10.1007/978-3-540-87875-9_60 , https://link.springer.com/chapter/10.1007/3-540-46102-7_20 ).

The ontology describes the relationships existing in the recommendation domain and provide the support for the selections and feedback loops.

I recommend to take a look at the paper and the framework if you want to build a recommendation system. I will, when looking at the assignments from the software measurement PhD course.

Generating comments in code – can a machine make our code more readable?

Our research team has been working with code comments for a while now. We have done analyses of source code comments and the code that is commented. We have also worked with the needs for code comments.

The results showed that AI support for code reviewing process is very much needed. However, it has also shown that the current tools are not good enough yet. One of the tools is DeepCode.ai, which analyzes code repository and finds problems in the code. It is a great tool, but it has been trained on large data sets from open source projects, which makes it tricky to use on proprietary software. It does not know how to capture the project specific characteristics.

I’ve recently cam across this article (https://link-springer-com.ezproxy.ub.gu.se/content/pdf/10.1007/s10664-019-09730-9.pdf) which is about generating code comments. It can generate code comments (not review comments) for the Java programming language. It “understands” what the code does and generates a natural language description of the code. The tool is based on the latest work from the NLP domain and has been trained on over 44 million statements, which is quite a number😊

The tool is not perfect (yet), but shows improvement over existing approaches and is definitely opening up new alleys of using NLP in Software Engineering.

Let’s stay tuned, more will come!

Quality of Deep Learning code – article review

A (deep) Staircase in Vatican, Image by JEROME CLARYSSE from Pixabay

http://swat.polymtl.ca/~foutsekh/docs/hadhemi-MSR2020.pdf

Deep learning models are often designed, trained and tested in Python. It is a language with a nice structure, quite straigtforward syntax and a lot of libraries. However, very few tutorials about deep learning (or any Python programming tutorials) discuss the quality of the code, e.g. its modularization, encapsulation, naming consistency.

As a result, a lot of code for machine learning, written in Python, often is hard to read and hard to grasp. Even if used as part of jupyter notebooks, the code is not really commented (often).

The study behind the link above is a study that supports my long gut feeling about this. The findings show that (from the abstract): First, long lambda expression, long ternary conditional expression, and complex container comprehension smells are frequently found in deep learning projects. That is, deep learning code involves more complex or longer expressions than the traditional code does. Second, the number of code smells increases across the releases of deep learning applications. Third, we found that there is a co-existence between code smells and software bugs in the studied deep learning code, which confirms our conjecture on the degraded code quality of deep learning applications.

The second finding, about the constant increase of the number of code smells, is similar to the studies we did in proprietary software about complexity – the complexity “never” decreases ( http://web.student.chalmers.se/~vard/files/Monitoring%20Complexity%20Evolution.pdf ).

The study compares 59 deep learning systems with 59 non-ML systems from GitHub. One could argue that the sample is not representative (no propprietary systems), but it is a fair sample.

To sum up, a very nice reading, showing that we need to think about quality, not only models, but also code quality.

How bugs are born: a model to identify how bugs are introduced in software components (review)

https://link-springer-com.ezproxy.ub.gu.se/content/pdf/10.1007/s10664-019-09781-y.pdf

I’ve came across this article from Empirical Software Engineering and it cought my attention. It describes a study of how to identify where a bug was introduced.

The article accurately observes that the defects are fixed, most often, in a place where they were NOT introduced. So, the question is whether we can find where the defects were introduced.

Several studies focused on understanding which release/commit introduced a specific defect. This article describes how to find this particular release. It is based on a theoretical framework of perfect tests, i.e. tests which can capture defects in releases where they were introduced. The authors of this study evaluate four different algorithms on two different open source projects. Their findings show that it is possible, to some extent, find the right release where the bug was introduced. Knowing the release and knowing which changes were introduced into the release, it is possible to narrow down the piece of code that contains the bug.

Very interesting work and looking forward to more studies in this area, in particular in the area of proprietary software!

Finding many needles in one haystack?

Image by S. Hermann & F. Richter from Pixabay

Multiple fault localization: https://www.sciencedirect.com/science/article/pii/S0950584920300641?dgcid=coauthor

A lot of defect research is focused on either localization of defects or the prediction whether a defect will be found/fixed, etc. I’m guilty to adding to the state of the art in this area with a number of articles. It’s a great line of work, nice because we can play with data and get results that can actually be verified – we can check whether a defect is or is not there.

However, in many cases, the defect can be a mistake made in a number of places – so-called a multiple fault or multiple faults. Therefore, this article, freshly from the press of IST, caught my attention. It presents a systematic review of what has been done in that area.

Turns out, not that much, but the field has been gaining popularity in the past few years.

What I like, in particular, about this paper is the fact that it asks a question about which datasets exist (see Table 8 in the paper for the full reference). I can’t wait to take a closer look at these datasets – maybe something for me PhD course in metrics next year?

Machine learning testing.. (review)

https://ieeexplore.ieee.org/abstract/document/9000651

Image by DarkWorkX from Pixabay

Testing of machine learning systems is a tricky business. Not only the algorithms are based on statistics, they are also very complex and they are highly dependent on the data that is used for training and validation. Yet, the algorithms are very important for our modern software systems and therefore we need to make sure that they work as they are intended to.

I’ve came across an article where authors reviewed literature on how machine learning systems are tested. A list of aspects that this paper looks into is:

What to test:

Test input generation
Test oracle generation
Test adequacy evaluation
Bug report analysis
Debug and repair

Where to test:

Data testing
Learning program testing
Framework testing

Test for what:

Correctness
Model Relevance
Robustness&Security
Efficiency
Fairness
Interpretability
Privacy

The list is quite impressing and so is the paper. For me, the most interesting category was the testing of data, which reviews challenges and also provides some solutions. For example, it lists frameworks which are used for testing of data: ActiveClean or BoostClean. These frameworks look at the data and try to capture how valuable the data is for the actual algorithm.

Stronger features vs. stronger algorithms in ML

I’ve been working with machine learning a bit during the last couple of years. I’ve had great teachers who showed me how to use the algorithms and where to start learning. Thanks to them I understood the importance of different elements of the ML tool chain – data, storage, algorithms, hardware.

I’ve worked on the problem of how to extract features of source code so that I can use them to predict if a specific line of code has a defect or not, in particular if the defect can be caught during code reviews. I’ve spent about a year on this problem and tested all kinds of combinations, from static code analysis to using word embedding, dictionaries and other NLP mechanisms to understand the code. Nothing really worked great. I got predictions that were a bit better than then chance.

What was the problem? Well, the problem was the quality of the input data. Since I extracted data, and features from this data, automatically from large code bases (often over 3 MLOC), I often encountered the following problems:

Labeling – I could not pinpoint exactly where the problem was, which meant that I needed to approximate the label, which led to the next problem,

Consistency – when one line was considered good by one person, it could be considered problematic by another one; this meant that I needed to decide how to treat lines that are “suspicious”, and

Scales – when extracting features, some of them were on scale of 1 to 100, whereas some other ones were on the scale from 1 to 3; this meant that I needed a good scaler to get the features right.

So, here I am, working on the next implementation of the feature discovery algorithm. The algorithm that can extract features in such a way that each objects has distinct characteristics, yet the number of features is as small as possible to characterize each object. The algorithm helped me to boost the accuracy of the classification from ca. 50% to over 96%.

I’ve discovered that using simple ML algorithms on a good data set trumps everything else. I used AdaBoost with scaling of features on the good data set, and that was at least twice as good as using LSTM models with word embeddings (which were not bad anyways) for the same purpose.

My advice, therefore, is the following:

Start with a simple classification/ML algorithm and do not go into neural networks or other advanced methods,

Learn your data and look at it from several angles; use business intelligence and statistics to understand the dependencies between features (PCA, t-SNE) and chew on the data as long as you can, and

Focus on extracting features from your data, rather than expecting magic from ML; no algorithm can trump good input data and no filtering can trump a good “featurizer”

Grit to Great or what we can learn from perseverance

I’ve picked up this book to learn a bit about perseverance and the power of pursuing goals. I’ve hoped to see if there is something I could learn about it for my new year’s resolutions.

It turned out to be a great book about being humble and to get rejected. Let me explain. The concept of grit means that one has the guts to do something. The resilience to get rejected. The initiative to start working on the next steps regardless of the outcome, and, finally the tenacity – the ability to focus on the goals.

The last one is an important one for the new year’s resolutions, but the resilience is an interesting quality. One can go on autopilot for the mundane things, but still needs the resilience when things go wrong. Sounds a bit like academic careers. We plan studies, conduct them, try to publish, get rejected, improve the papers, try to publish, etc.

We also need to have initiative to move the field of our study forward. We need to come up with new project ideas, submit research proposals. Get rejected. Fine tune the proposals, resubmit somewhere else, and so on.

Finally, the guts is a big quality. Researchers need to have the guts to take on big problems, to plan and conduct the studies, to speak in front of large audiences. Yes, speaking is not something that comes easy to most of us. We still need to prepare and find what we want to say and how. We need to adjust the talks based on the audience, the message and the goal of the talk.

It’s a great book to get some motivation for the work after the vacations. Work hard, publish, apply for funding and work even harder. Amidst all of that, please remember that you need to have students with you and that they need your attention too!