Reproducing AI models – a guideline

Image by Pete Linforth from Pixabay

2107.00821.pdf (arxiv.org)

Machine learning has been used in software engineering as a great tool for both research and development. The fact that we have access to TensorFlow, PyCharm, and other toolkits, provides almost endless possibilities. Combine that with the hundreds (if not thousands) of datasets from Zenodo and Co. and you can train a model for almost anything.

So far, so good, I would say. Problems (yes, there are always some problems) appear when we want to reproduce the results of others. Training a model on your own dataset and making it available is easy. Trusting such a model in a new context is not.

Imagine an example of an ML model trained on data from Company X. We have probably tuned the parameters a lot, so the model works great there, but does it work for Company Y? Most probably it will not. Well, it will work, but the performance of the predictions are not going to be great.

So, Google has partner up with academic partners to set up SIGMODELS, and TensorFlow garden, initiatives that are aimed at making ML models more portable, experiments more replicable, and all the other goodies.

In this paper, the authors provide a set of checks, which we can use to make the models more transparent, which is the first step towards reproducibility. In these guidelines, the authors advocate for reporting the models architecture, their input and output structure, building blocks, loss functions, etc.

Naturally, they also recommend to report metrics which were used to optimize the models, e.g. accuracy, F1-score, MCC or others. I know, these are probably essentials, but you would be surprised to see that many authors do not really report these metrics. If they are omitted, then how do we know if the metrics were just so poor that the authors omitted them (low performance of the model) or that they are not relevant (low relevance of the metrics – which is a good thing).

For now, these guidelines are only a draft, but I hope that they will become more mainstream. just like the emprical guidelines from ACM (GitHub – acmsigsoft/EmpiricalStandards: Empirical standards for conducting and evaluating research in software engineering).

What will the future bring ….

https://www.amazon.se/Brief-Answers-Big-Questions-Stephen/dp/1473695996/ref=sr_1_1?dchild=1&keywords=Hawking&qid=1625726909&sr=8-1

As my summer goes on, I’ve decided to take a look at the book of one of my favorite authors and scientists – Prof. Stephen Hawking. I’ve loved his books when I was younger and I like the way he could bring difficult theories to the masses, like his famous “gray holes” as opposed to the black ones.

This book allowed me to reflect on some of the most common questions that people ask, even to me. Like whether AI will take over or whether we should even invest in AI. Since I’m not a physicist, I cannot answer most of the questions, but I think the AI question is something that I can even attempt.

So, will AI take over? Is GPT-3 something to worry about? Will we be out of work as programmers? Well, not really. I think that we live in a world that is very diverse and that we need human judgement to make sure that we can live on. Take the recent cyber attack on Kaseya, which is a US-based company with thousands of clients. The attack affected a minority of their clients, some 40 or so (if I remember the article correctly). However, it make the entire COOP chain in Sweden stranded. Food was given away for free as there was no way to take payments. Other grocery shops bought the grocery stock from the affected chain, to make sure people have enough food. So, what would AI do?

Let’s think statistically for a moment. 40 customers, out of ca. 40,000, is about 0.1 percent. So, ignoring this event would just make 99.9% accuracy for AI. Is this good? Statistically, this is great! Almost perfect. For AI, therefore, this would be like a great way of optimizing. Not paying ransom, make sure that 0.1% is taken as a negligible error somewhere.

Now, let’s think about the social value. Without knowing the rest of the customers affected, or even the ones that were not affected, I could say that the value of having food on your table trumps many other kinds of value. Well, maybe not your health, but definitely something like a car or a computer game. So, the societal impact of that is large. We could model that in the AI, but there is programmatic problem. How to calculate value of diverse things – a car or food. There is the monetary value, of course, but it’s not constant in time. For someone who is hungry for days, the value of a sandwich is infinitely larger than the value of the same sandwich for someone who has just eaten a delicious stake. Another problem is that the value is dependent on the location (is there another grocery shop close by?), your stock (which is individual and hard to find for AI), or even the ability to use another system of payment (can I just get my groceries and pay later?)

This example shows an inherent problem in finding the right data to use for AI. I believe that this is a problem that will not really be solved. And if it cannot be solved, I think I would like to pay a few cents extra to have a human in the loop. I would liked to know that there is an option, in the event if a hacker attack, to talk to a person who understands my needs and can help me. Give me food without paying, knowing where I live and that I will pay later.

Until there are models which understand us, humans, we need to be able to stick to having humans in the loop. Given that there are ca. 6 billion people in the world, potentially different, with conflicting needs, I do not things AI will be able to help us in critical parts of the society.

New from ML?

I seldom write about films and events, well maybe actually never, but this year, a lot has happened in the online way.

What’s new in Machine Learning | Keynote – YouTube

The video above is about the news from Google about their TensorFlow library, which include new ways of training models, compression and performance tuning and more.

TensorFlow Light and TensorFlow JS allow us to use the same models as for desktops, but on mobile devices. Really impressive. I’ve caught myself thinking whether I’m more impressed by the hardware capabilities of small devices, or the capabilities of software. Either way – super cool.

Google is not the only company announcing something. NVidia is also showing a lot of cool features for enterprises. Cloud access for rapid prototyping, model testing and deployments are in the center of that.

NVIDIA Executive Keynote for Enterprise AI at COMPUTEX 2021 – YouTube

I like gaming, so this is impressing, but even more impressive is to look at the last-year’s DLSS technology, which still cannot be beaten by the competition. Really nice.

Challenges when using ML for SE (article review)

Image by Pexels from Pixabay

104294.pdf (scitepress.org)

Machine learning has been used in software engineering for a while now. It used to be called advanced statistics, but with the popularization of artificial intelligence, we use the term machine learning more often. I’m one of those who like to use ML. It’s actually a mesmerizing experience when you train neural networks – change one parameter, wait a bit and see how the network performed, then again. Trust me, I’ve done it all too often.

I like this paper because it focuses on challenges for using ML, from the abstract:

In the past few years, software engineering has increasingly automating several tasks, and machine learning tools and techniques are among the main used strategies to assist in this process. However, there are still challenges to be overcome so that software engineering projects can increasingly benefit from machine learning. In this paper, we seek to understand the main challenges faced by people who use machine learning to assist in their software engineering tasks. To identify these challenges, we conducted a Systematic Review in eight online search engines to identify papers that present the challenges they faced when using machine learning techniques and tools to execute software engineering tasks. Therefore, this research focuses on the classification and discussion of eight groups of challenges: data labeling, data inconsistency, data costs, data complexity, lack of data, non-transferable results, parameterization of the models, and quality of the models. Our results can be used by people who intend to start using machine learning in their software engineering projects to be aware of the main issues they can face.

So, what are these challenges? Well, I’m not going to go into details about all of them, but I’d like to focus on the ones that are close to my heart – data labelling. The process of labelling, or tagging, data is usually very time consuming and very error-prone. You need to be able to remember how you actually labelled the previous data points (consistency), but also understand how to think when finding new cases. This paper does not list the challenges, but gives a pointer to a few paper where they are defined.

Siri, Write the Next Method… (article highlight)

BIld av yangjiepsy01 från Pixabay

Wen2021a.pdf (usi.ch)

I’ve came across this article by accident. Essentially I do not even remember what I was looking for, but that’s maybe not so important. Either way, I really want to try this tool.

This research study is about designing a tool for code completion, but not just a completion of a word/statement/variable, but providing a signature of the next method to implement.

From the abstract: “Code completion is one of the killer features of Integrated Development Environments (IDEs), and researchers have proposed different methods to improve its accuracy. While these techniques are valuable to speed up code writing, they are limited to recommendations related to the next few tokens a developer is likely to type given the current context. In the best case, they can recommend a few APIs that a developer is likely to use next. We present FeaRS, a novel retrieval-based approach that, given the current code a developer is writing in the IDE, can recommend the next complete method (i.e., signature and method body) that the developer is likely to implement. To do this, FeaRS exploits “implementation patterns” (i.e., groups of methods usually implemented within the same task) learned by mining thousands of open source projects. We instantiated our approach to the specific context of Android apps. A large-scale empirical evaluation we performed across more than 20k apps shows encouraging preliminary results, but also highlights future challenges to overcome.”

As far as I understand, this is a plug-in to android studio, so I will probably need to see if I can use it outside of this context. However, it seems to be very interesting….

Data labelling – activity that makes people hate ML….

Image by S. Hermann & F. Richter from Pixabay

Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies | SpringerLink

Machine learning is hungry for data. The more you have, the happier it will be. Seems very easy when we learn how to program ML and how it works – there is plenty of open data sources to practice and learn from.

However, when we want to use ML for our purposes, things get a bit more complicated. There is a lot of data, but not in the right format. The one that is in the right format is incomplete. The one that is complete, is noisy. The one that is not noisy is too little. We need to collect more. And so the story goes on, and on, and on….

Collecting the data is not that problematic, as it can often be automated. At least in software engineering, automotive, telecon, transport/logistic and medicine. These are the ones I know, anyways. What is problematic, though is data labelling. It is the activity where we take each data point and add a class to it, or its label if we speak machine-learnish. The person doing the labelling needs to be competent to be able to label the data correctly – he/she needs to know the domain, know the data, know the context. Then, this person also needs to have a fantastic memory, because the labels need to be consistent. They also need to be unambiguous given the underlying feature vector.

In this paper, colleagues from our department study the process of data labelling and its challenges.

They find the following to be selected examples of challenges:

  • Lack of a systematic approach to labeling data for specific features
  • Unclear responsibility for labeling
  • Noisy labels
  • Difficulty to find a correlation between labels and features
  • Skewed label distributions
  • Time dependence
  • Difficulty to predict future uses for datasets

I think it’s a great work and reading for everyone who wants to get into ML for real, start using it at a company and understand whether it’s actually gives any benefit.

From the abstract: Labeling is a cornerstone of supervised machine learning. However, in industrial applications, data is often not labeled, which complicates using this data for machine learning. Although there are well-established labeling techniques such as crowdsourcing, active learning, and semi-supervised learning, these still do not provide accurate and reliable labels for every machine learning use case in the industry. In this context, the industry still relies heavily on manually annotating and labeling their data. This study investigates the challenges that companies experience when annotating and labeling their data. We performed a case study using a semi-structured interview with data scientists at two companies to explore their problems when labeling and annotating their data. This paper provides two contributions. We identify industry challenges in the labeling process, and then we propose mitigation strategies for these challenges.

What’s going on in Deep Learning – a literature review (paper review)

https://arxiv.org/pdf/2009.06520.pdf

Image by Free-Photos from Pixabay

Deep learning in software engineering has been used extensively and there is a significant body of research about this topic. In this post, I would like to share my review of the recent systematic review on the use of DL in SE.

The interesting finding is the list of data sources for data for DL. Here, the source code data is prevalent. This is not surprising as we have GitHub with millions of repositories. The second largest is the repository metadata, again, for the same reason.

Although it is not surprising, it is really good to see this. I see it as a change in the research focus in the last 10 years. It shifted from the research on bugs and bug reports to the research on source code. I’m happy because helping out with the source code is the real improvement of the product, not an improvement of the process.

Another interesting finding is the use of natural language techniques as the most common ones, here I cite the paper: “Our analysis found that, while a number of different data pre-processing techniques have been utilized, tokenization and neural embeddings are by far the two most prevalent. We also found that data-preprocessing is tightly coupled to the DL model utilized, and that the SE task and publication venue were often strongly associated with specific types of pre-processing techniques.

I recommend to read the article to get more insight into DL models used, which are quite many – from the standard cNN to more advanced GANs and AutoEncoders. Really nice!

Finally, the paper ends with a recommendation on how to use DL in other contexts, kind of flowchart. I do not want to copy it here, so I recommend to take a look at it in the paper: https://arxiv.org/pdf/2009.06520.pdf

Finding lines of code that require review – my 100 blog post!

Image by skeeze from Pixabay

Working with continuous integration is an exciting new filed. You get your code into the main branch directly. Well, that’s what the theory says. What you really get is feedback directly, at least the feedback from the automated checks for technical debt, testing and similar.

What you do not get quickly is the review of your code by your colleagues. In larger organizations, things like code reviews do not get prioritized. Therefore they tend to slow down software development rather than speed up!

In this paper, we set of to understand how to fix that. We used Gerrit as the tool to extract lines of code to review, instead of reviewing all of the lines. Here is a short video about this: https://play.gu.se/media/t/0_h7hx95d2

The abstract of the paper is included:

Code reviews are one of the first quality assurance tasks in continuous software integration and delivery. The goal of our work is to reduce the need for manual reviews by automatically identify which code fragments should be further reviewed manually. We conducted an action research study with two companies where we extracted code reviews and build machine learning classifiers (AdaBoost and Convolutional Neural Network — CNN). Our results show that the accuracy of recognizing code fragments that require manual review, measured with Matthews Correlation Coefficient, was 0.70 in the combination of our own feature extraction and CNN. We conclude that this way of combining automation with manual code reviews can improve the speed of reviews while providing organizations with the possibility to support knowledge transfer among the designers.

Recommending refactoring via commit message analysis

Image by annca from Pixabay

https://doi.org/10.1016/j.infsof.2020.106332

In the process of reviewing code, we can identify refactoring pretty easy. We read the code, try to understand it and provide comments. In the understanding phase we also get ideas about possible alternative – why is this done this way?

Now, when writing the comments, we rarely have the time to refactor the code. In CI, this process of reviewing comes when we commit the code to the main branch and therefore we expect this to be delivered and used soon. So, it’s too late to refactor, we need to do it in the next iteration.

But the next iteration is the same, we need to deliver new functions, not “golden plate” the existing code, deliver it to the main branch, etc. When is the time for refactoring then? How do we document the possibilities and use them when we have a bit of time?

In this work, the authors look at the commit messages and identify refactoring possibilities for that, complementing the static and dynamic analysis of code. The method presented in the paper is based on the analysis of code from open source projects, the refactoring applied to the code and the analysis of the QMOOD quality attributes that were related to these commits.

The following quote from the paper explains a bit how the gist of the extraction of the refactored code works:

Identifying refactoring rationale has two parts. The first part is the detection of the files that are refactored by developers in a commit. The second part is the identification of changes in the QMOOD quality attributes then comparing these changes with the information in the commit message.

For the first part, we used the GitHub API to identify the changed files in each commit. In the second part, we compared the QMOOD quality attribute values before and after the commit to capture the actual quality changes for each file. Once the changed files and quality attributes were identified, we checked if the developers intended to actually improve these files and quality attributes. In fact, we preprocessed the commit messages and we used the names of code elements in the changed files and the changed quality metrics as keywords to match with words in the commit message. Once the refactoring rationale is automatically detected using this procedure, we continue with the next step to find better refactoring recommendations that can fully meet the developer’s intentions and expectations. In case that no quality changes were identified at all then a warning will be generated to developers that the manually applied refactorings are not addressing the quality issues described in his commit message.

Who should fix this bug – again?

Image by Iván Tamás from Pixabay

Link to article: https://doi.org/10.1002/spe.2830

Assigning defects is a task that is not so much fun. Companies need to do that, but the persons who do it often change as the task is quite labor intensive and tiresome. There is, of course, a significant body of research about this and here is one example of it.

What is interesting in this article is that the authors use temporal data about the defect reports to assign teams. From the abstract: “In this article, we describe a new BA approach that relies on two key intuitions. Similar to traditional BA methods, our method constructs the expertise profile of project developers, based on the textual elements of the bugs they have fixed in the past; unlike traditional methods, however, our method considers only the programming keywords in these bug descriptions, relying on Stack Overflow as the vocabulary for these keywords. The second key intuition of our method is that recent expertise is more relevant than past expertise, which is why our method weighs the relevance of a developer’s expertise based on how recently they have fixed a bug with keywords similar to the bug at hand.

The method uses text similarity measures to match defects and performs better than existing methods based on the meta-parameters. What it means in practice is that the only thing that is needed is the actual defect description, or actually a failure report in order to make the predictions.

Very interesting work to apply, it seems that the entry level is not that high for new companies.