Machine learning and deep learning are only as good as the data used to train them. However, even the best data sources can lead to data of non-optimal quality. Noise is one of the exampes of the data problems.
Our research team has studied the impact of noise on machine learning in software engineering – mostly on the testing data. In this paper we present one techniques to identify noise, measure it and reduce it. There are several techniques to do it, but we use one of the more robust ones – removal of noise.
I recommend to take a look at how the algorithms work and let us know if you find it interesting!
Deep learning in software engineering has been used extensively and there is a significant body of research about this topic. In this post, I would like to share my review of the recent systematic review on the use of DL in SE.
The interesting finding is the list of data sources for data for DL. Here, the source code data is prevalent. This is not surprising as we have GitHub with millions of repositories. The second largest is the repository metadata, again, for the same reason.
Although it is not surprising, it is really good to see this. I see it as a change in the research focus in the last 10 years. It shifted from the research on bugs and bug reports to the research on source code. I’m happy because helping out with the source code is the real improvement of the product, not an improvement of the process.
Another interesting finding is the use of natural language techniques as the most common ones, here I cite the paper: “Our analysis found that, while a number of different data pre-processing techniques have been utilized, tokenization and neural embeddings are by far the two most prevalent. We also found that data-preprocessing is tightly coupled to the DL model utilized, and that the SE task and publication venue were often strongly associated with specific types of pre-processing techniques.“
I recommend to read the article to get more insight into DL models used, which are quite many – from the standard cNN to more advanced GANs and AutoEncoders. Really nice!
Finally, the paper ends with a recommendation on how to use DL in other contexts, kind of flowchart. I do not want to copy it here, so I recommend to take a look at it in the paper: https://arxiv.org/pdf/2009.06520.pdf
Initially, I did not really think that I would put this article on the blog. I actually thought about using it in my writing advice page. However, I’ve read it and then I realized that it’s actually more suitable for this blog.
This article shows how we can classify software engineering research. It has a nice framework described in Figure 1. It organizes the framework around the concepts of who the main beneficiary is – e.g. human, system or researcher (yes, it is a different category!), type of research contribution and which research strategies are used.
It’s an article that complements the work of our colleagues from Lund University on the design of design science research studies and the construction of graphical abstracts.
Although the work seems to be obvious when you are a seasoned researcher, I need to be reminded sometimes about what kind of study that I want/need to conduct. Therefore I recommend this as a reading to both PhD candidates, master students and also advanced researchers. Using the classification scheme will definitely help us to understand each other better and to reduce the burden of paper reviews!
I’ve been looking out for good examples of articles about action research in software engineering for a while. There is a lot of those coming from the participatory design community and ethnography in software engineering.
This paper is an example of how one can conduct action research together with an open source community. It shows how to conduct the research while being part of the community and adds a new angle on the topic – how do we democratize the research design. In contrast to company-based development, an open source community is free to accept the new ways of working or not. Therefore, it can be challenging to make the action happen.
Figure 1 from the paper shows the process in more detail and I strongly recommend to take a look at it. It starts from the design of intervention, where community requirements, similar communities, best practices and problems are inputted. This similar communities precedence is new and important as it helps to leverage already adopted good practices.
The evaluation of the methodology was already done and it shows that it’s a valid and interesting new research method!
Abstract: Participatory Action Research (PAR) is an established method to implement change in organizations. However, it cannot be applied in the open source (FOSS) communities, without adaptation to their particularities, especially to the specific control mechanisms developed in FOSS. FOSS communities are self-managed, and rely on consensus to reach decisions. This study proposes a PAR framework specifically tailored to FOSS communities. We successfully applied the framework to implement a set of quality assurance interventions in the Robot Operating System community. The framework we proposed is composed of three components, interventions design, democratization, and execution. We believe that this process will work for other FOSS communities too. We have learned that changing a particular aspect of a FOSS community is arduous. To achieve success the change must rally the community around it for support and attract motivated volunteers to implement the interventions.