This week we had the possibility to give a webinar about how to work with large scale measurement programs. The webinar was dedicated for everyone who works with software metrics and would like to get more impact from that work.
It is not so much about the numbers, it is about the impact and what the numbers mean. The webinar that we present, provides a good understanding of how to make this impact. Based on our experiences, we chose all one needs to know to implement a measurement program in few weeks rather than years.
The webinar has been recorded and is available at this link: https://www.youtube.com/watch?v=2ChaVT_3djE&feature=youtu.be
Engineers and scientists love to measure. We measure complexity of software, its performance, size and maintainability (just to name a few). We need these measurements in order to construct software, manager organizations or release high quality, high reliable products. However, there is a difference between measuring software aspects and using the measures in decision processes. In this talk, we present the concept of measurement program, measurement system, information quality and indicator-triggered decisions. We show what to consider when setting up measurement programs and provide a hints about the costs and benefits of having the program. We end the talk with presenting recent research results from Software Center, where we combine measurements and machine learning to speed-up software development.
A while back we gave a webinar with a similar title, where we focused on the questions concerning the measurement infrastructure, visualization and assessment of the measurement program. The ACM webinar is presented here:
Code smells are quite interesting phenomena to study. They are not really defects, but they are not good code either. They exist, but people rarely want to admit to them. There is also no consensus to how much effort it takes to remove them (or even whether they should be removed or just avoided).
In this paper, the authors study whether it is possible to use ML to find code smells. It turns out it is possible and the accuracy is quite high (over 95%). It also shows that sometimes it is better to show a number of recommendations (e.g. two potential smells) rather than one – it requires less accuracy to make the recommendation, but helps the users to narrow-down their solution spaces.
Data veracity is a concept where we define the degree to which data corresponds to the true values. It comes from the metrological concept of “measurement trueness”, which is the degree to which the measurement quantifies the value correctly.
Well, that sounds very simple, but it is in fact quite complex. In our previous work, we scrutinized what it means to have veracious data in transport systems (https://ieeexplore.ieee.org/abstract/document/7535482). It turns out that “lying” is not the only option here.
In this book, the author looks into the way how things can be untrue. Sometimes deliberately by lying, sometimes by mistake. Sometimes, as we learn in the last chapter (with Brazilian aardvark), a mistake can actually end up being accepted as truth over time.
I recommend the book as it is written in a fantastic manner, providing examples from the real world (e.g. the alleged drone sightings over Gatwick in 2018). It even goes a bit further and discusses the need of replication of studies and that we should get more funding for making the scientific results more solid and robust.
This is a great paper demonstrating the use of NLP techniques for completion of software source code. It uses recurrent networks and can reduce the size of the vocabulary compared to previous approaches.
As the authors say: “The CodeGRU introduces a novel approach which can correctly capture the source code context by leveraging the token type information.”
I like the approach because it can extract the information that is important for the analysis of source code – what kind of token is analysed and how it is used.
Conclusions (quote from the abstract): “Our experiment confirms that the source code’s contextual information can be vital and can help improve the software language models. The extensive evaluation of CodeGRU shows that it outperforms the state-of-the-art models. The results further suggest that the proposed approach can help reduce the vocabulary size and is of practical use for software developers.”
I’m kind of keen to check this approach in our work. See if we can use this to improve the quality of source code.
I’ve written about the ways of assessing how good software is. One of the modern approaches, which I talked about before, is the use of A/B testing and online experiments. Providing the users with different versions of the features/systems/use cases allows the company to understand which of the options provides the best response from the users.
However, there are a number of challenges with this approach – the most prominent being the potential existence of confounding factors. Even if the results show a positive/negative response, we do not really know whether the response is not caused by something else (for example by users being tired, changes in the environment, etc.)
After using GitHub, both as a user and as a researcher, I sometimes wondered whether the star system is actually the right one. I wondered whether we should use a sort-of A/B testing system where we could check how often people usually access certain repositories.
In this paper, the authors take a look at different ways of assessing popularity of repositories. The results show that regardless of the metrics, the popular repositories are popular – i.e. popularity is not dependent of a metric.
Popularity metrics studied:
Total number of downloads of the package
Number of projects dependent on the package
Number of repositories dependent on the package
Source rank of the package
Number of forks
Number of watchers
Number of contributors
Number of stars
Number of open issues
Total number of tags
The actual analysis is quite interesting, so I recommend to take a look at the paper directly.
Working with software requirements and metrics is an important part of research in modern software companies. Although many of the companies are Agile or post-Agile, claiming that they do not have requirements, they still capture user needs in textual forms. For example, they describe user stories, epic, use cases.
This paper is an interesting view on the software requirements quality assessment. Instead of just calculating metrics and creating quality models, they use machine learning to mimic the way in which experts judge what is a good requirement and what is not. They use quality functions, and several of them, to distinguish between the good and bad requirements. Using multiple functions, in a multidimensional space, allows to select groups of requirements that are separated by the other class – the figures in the paper show more how this works in practice.
The summary of the gist of the paper is actually presented best in the introduction (quote): “Summing up, we can compute a set of quantitative metrics of textual requirements, and through them, we can assess the quality of requirements. However, the risk of this approach is to build assessment methods and tools that are both arbitrary in the parameterization of metrics and rigid in the combination of metrics to evaluate the different properties. This is why we propose in this work to develop a flexible assessment method that can be adapted to different contexts, with a high degree of automation. The method consists basically in the emulation of the experts’ judgment on quality through artificial intelligence techniques: first, obtain the expert’s implicit quality function through machine learning, and, second, apply this function to automatically assess the quality of textual requirements.
Our approach to emulate the experts’ judgment, as explained later in detail, is based on well-known machine learning techniques: we have a computer tool learn from a previous human-made classification of requirements according to their quality. Therefore, our work’s intent is not to improve machine learning techniques, but rather to devise a novel application to the field of requirements quality assessment.”
I strongly recommend to read the paper as it provides very good methods to work with requirements quality in many modern organisations.
Creating recommendation systems is a tricky task. We need to add the temporal domain to the data. In particular, we need to make sure that we capture what was recommended before to the specific user and how the user reacted upon that. We also need to capture the evolution of the users and the data.
In this paper, the authors present a framework, RectoLibry, which helps to construct these kind of systems. The system captures both the parts of the development of the recommendations, but also their deployment.
Our research team has been working with code comments for a while now. We have done analyses of source code comments and the code that is commented. We have also worked with the needs for code comments.
The results showed that AI support for code reviewing process is very much needed. However, it has also shown that the current tools are not good enough yet. One of the tools is DeepCode.ai, which analyzes code repository and finds problems in the code. It is a great tool, but it has been trained on large data sets from open source projects, which makes it tricky to use on proprietary software. It does not know how to capture the project specific characteristics.
I’ve recently cam across this article (https://link-springer-com.ezproxy.ub.gu.se/content/pdf/10.1007/s10664-019-09730-9.pdf) which is about generating code comments. It can generate code comments (not review comments) for the Java programming language. It “understands” what the code does and generates a natural language description of the code. The tool is based on the latest work from the NLP domain and has been trained on over 44 million statements, which is quite a number😊
The tool is not perfect (yet), but shows improvement over existing approaches and is definitely opening up new alleys of using NLP in Software Engineering.
Deep learning models are often designed, trained and tested in Python. It is a language with a nice structure, quite straigtforward syntax and a lot of libraries. However, very few tutorials about deep learning (or any Python programming tutorials) discuss the quality of the code, e.g. its modularization, encapsulation, naming consistency.
As a result, a lot of code for machine learning, written in Python, often is hard to read and hard to grasp. Even if used as part of jupyter notebooks, the code is not really commented (often).
The study behind the link above is a study that supports my long gut feeling about this. The findings show that (from the abstract): First, long lambda expression, long ternary conditional expression, and complex container comprehension smells are frequently found in deep learning projects. That is, deep learning code involves more complex or longer expressions than the traditional code does. Second, the number of code smells increases across the releases of deep learning applications. Third, we found that there is a co-existence between code smells and software bugs in the studied deep learning code, which confirms our conjecture on the degraded code quality of deep learning applications.
I’ve came across this article from Empirical Software Engineering and it cought my attention. It describes a study of how to identify where a bug was introduced.
The article accurately observes that the defects are fixed, most often, in a place where they were NOT introduced. So, the question is whether we can find where the defects were introduced.
Several studies focused on understanding which release/commit introduced a specific defect. This article describes how to find this particular release. It is based on a theoretical framework of perfect tests, i.e. tests which can capture defects in releases where they were introduced. The authors of this study evaluate four different algorithms on two different open source projects. Their findings show that it is possible, to some extent, find the right release where the bug was introduced. Knowing the release and knowing which changes were introduced into the release, it is possible to narrow down the piece of code that contains the bug.
Very interesting work and looking forward to more studies in this area, in particular in the area of proprietary software!