article review – Page 5 – SE metrics (Software Engineering)

Siri, Write the Next Method… (article highlight)

Wen2021a.pdf (usi.ch)

I’ve came across this article by accident. Essentially I do not even remember what I was looking for, but that’s maybe not so important. Either way, I really want to try this tool.

This research study is about designing a tool for code completion, but not just a completion of a word/statement/variable, but providing a signature of the next method to implement.

From the abstract: “Code completion is one of the killer features of Integrated Development Environments (IDEs), and researchers have proposed different methods to improve its accuracy. While these techniques are valuable to speed up code writing, they are limited to recommendations related to the next few tokens a developer is likely to type given the current context. In the best case, they can recommend a few APIs that a developer is likely to use next. We present FeaRS, a novel retrieval-based approach that, given the current code a developer is writing in the IDE, can recommend the next complete method (i.e., signature and method body) that the developer is likely to implement. To do this, FeaRS exploits “implementation patterns” (i.e., groups of methods usually implemented within the same task) learned by mining thousands of open source projects. We instantiated our approach to the specific context of Android apps. A large-scale empirical evaluation we performed across more than 20k apps shows encouraging preliminary results, but also highlights future challenges to overcome.”

As far as I understand, this is a plug-in to android studio, so I will probably need to see if I can use it outside of this context. However, it seems to be very interesting….

Exploring code weaknesses in StackOverflow

https://doi.ieeecomputersociety.org/10.1109/TSE.2021.3058985

Whether we like it or not, software designers, programmers and architects use StackOverflow. Mostly because they want to be part of a community – help others and help themselves.

However, StackOverflow has become a de-facto go-to place to find programming answers. Oftentimes, these answers include usage of libraries or other solutions. These libraries solve the immediate problems, but they can also introduce vulnerabilities that the programmers are not aware of.

In this article, the authors review how C/C++ authors introduce and revise vulnerabilities in their code. From the introduction: “We scan 646,716 C/C++ code snippets from Stack Overflow answers. We observe that code weaknesses are detected in 2% of the C/C++ answers with code snippets; more specifically, there are 12,998 detected code weaknesses that fall into 36% (i.e., 32 out of 89) of all the existing C/C++ CWE types. “

I like that the paper presents a number of good examples, which can be used for training of software engineers. Both at the university level and later during their work. Some of them can even be used to create coding guidelines for companies – including good and bad examples.

The paper has a lot of great findings about the way in which weaknesses and vulnerabilities are introduced, for example ” 92.6% (i.e., 10,884) of the 11,748 Codew has weaknesses introduced when their code snippets were initially created on Stack Overflow, and 69% (i.e., 8,103 out of 11,748) of the Codew has never been revised “

I strongly recommend to read the paper and give it to your software engineers to scan….

Crowdsmelling…

2012.12590.pdf (arxiv.org)

The concept of crowdsourcing is well known in our community. We are accustomed to reading other’s code and learning from it at the same time improving it. Even the “captcha’s” are a good example of crowdsourcing.

However, crowdsmelling? Well, the idea is not as outrageous as one might think. It’s actually an interesting one. It is essentially a way of using collective knowledge about code smells to design machine learning to recognize them. It’s actually the very idea which we use in our Software Center project, and which we support.

In this paper, the authors focus on special kind of code smells – the ones linked to technical debt. The results are promising and we should keep an eye on this work in order to see if this improves.

From the abstract: “Good performances were obtained for God Class detection (ROC=0.896 for Naive Bayes) and Long Method detection (ROC=0.870 for AdaBoostM1), but much lower for Feature Envy (ROC=0.570 for Random Forrest).”

War and algorithm

Amazon.com: War and Algorithm (9781786613646): Liljefors, Max, Noll, Gregor, Steuer, Daniel: Books

Understanding legal aspects of modern autonomous systems requires a philosophical and practical discourse. On the one hand, we need to understand what legal responsbility means in the context of autonomous systems. We need to understand who is responsible for the actions of the system, what the actions are and whether the system actually reacted as designed vs. whether new behaviour occurred.

On the other hand, we also need to understand that the introduction of the autonomous systems changes the legal systems. Autonomous systems do not require operators and therefore they are capable of interacting with each other. The notion of conflicts, damage and collateral damage get completely new dimensions.

I’ve picked up this book because I’ve had the possibility to work with one of the authors and met one more at a dinner a while back. They got me interested in the legal aspects of autonomous systems. In their book, the authors discuss various aspects of such systems. They start from the foundation of the legality of conflicts and then they move over to modern warfare. They provide historical examples of how the legal systems were (and are) shaped by the so-called LAWS (Lethal Autonomous Warfare Systems).

I particularly like the aspects related to the design of the systems and the fact that in chapter 4, the authors discuss the process of learning from the machine, or the algorithm. They call it the process of debugging, which is a new way of looking at the concept of understanding algorithms.

What I miss in the book, however, is the discussion on the quality of the AI systems. Although it is not explicit, it seems to me that the authors assume that an AI system is perfect, makes no mistakes dues to design defects (bugs). If this assumption is true, then it the discussion about the responsibility is a bit simpler, because we do not recognize the problems where an individual (a programmer) gives his/her best, but the testers or others in his team make mistakes. So, the responsibility is not on an individual (programmer, tester, architect), but on the entire company.

Either way, I’m happy that I had the possibility to listen to some of the authors and to work with them.

Law for Computer Scientists and other folks (review)

Law for Computer Scientists (pubpub.org)

Recently, a colleague of mine has recommended me this book. At first, I thought it would be a bit like “Law for dummies”, but it turned out to be much better than I actually thought.

The book is about how we, as software engineers, should look at the legal systems. It poses more questions than it actually answers, but it provides a number of great examples.

I sincerely recommend this book. The following parts have captured my attention:

Existence of different types of law and jurisdictions: national, international and supernational. Data and computer programs are perfect examples of different jurisdictions and the fact that different types of laws apply.
What constitutes data, meta-data and sensitive data. In Chapter 5, the authors mention that we cannot process sensitive data very easily, e.g. data about religion, gender, etc. Then, how can we make the systems fair and unbiased if we cannot process this kind of data?
Cybercrimes and how to deal with them. The author provides great examples of legislation that is supposed to help to fight cybercrime.

However, the best is always left for last and this book is no exception. The author provides a great discussion on the future of our legal systems. She does that by discussing the concept of personhood for AI or any other complex system. Although it sounds like a distant future, it is closer than we think. EU has already started to work on this kind of legislation.

Finally, I love the fact that the author brings in the three laws of robotics by Asimov – a real connection to computer science and software engineering.

Understanding what’s going on helps you become a better software developer…

10.1109/MS.2020.3014223

I’m a big fan of the Matrix movies, but well, to be honest, who isn’t:) I like the scene where Morpheus gives Neo the choice of two pills – one to know the truth and the other one to go on living his life as previously.

Well, sometimes I feel the same when I do my programming tasks – do I really want to know what the code does, or just make a quick fix and move on? I would say that it’s 50-50 for me – sometimes I feel like contributing and sometimes I just fix the problem and move on.

In this paper, the authors conduct an experiment to understand how and when software developers make mistakes. They find that “[the] study suggests that a relatively high number of mistakes are related to communicating with stakeholders outside of the development team. “

Having worked with metrics teams all over the globe, I’ve noticed that the communication with the stakeholders is often the largest problem that you can have. The stakeholders don’t speak “requirements” and we do not understand “wants” of the stakeholders. But, well, it’s not what the paper is about.

What I like about the paper is the systematic approach to the study – using experiments and a technique for teaching the developers how to work with their limitations. This is what the authors recommend as remedies (quoted directly from the paper):

Know your own weaknesses. Every developer is different and struggles with different concepts. Our analysis shows a variety of types of errors that developers make. Developers becoming more conscious of the human errors they commonly make and actively checking for these can help reduce errors.
Use cognitive training. We have shown that using cognitive training, like the OODA loop, seems to help decision making and can reduce the human errors a developer makes.
Simplify your workload. One of the biggest causes of human error reported by the developers in our study was the complexity of the development environment. Reducing the cognitive load by simplifying the complexity of the development environment could reduce human errors. Actions such as minimizing the number of simultaneous development tasks and closing down unnecessary tools and windows can help reduce the cognitive load.
Communicate carefully with stakeholders outside your team. Our study suggests that a relatively high number of mistakes are related to communicating with stakeholders outside of the development team. Ensuring that communication is clearly understood seems important to reducing mistakes.

Consistency in code reviews (article review)

tse2020_hirao.pdf (uwaterloo.ca)

In the last year, I’ve written a lot about code reviews, mostly because this is where I put my effort now and where I see that software engineers could improve.

Although there is a lot of studies about how good code reviews are and what kind of benefits they bring, there is no doubt that code reviews are a tiresome task. You read software code and try to improve it, but, let’s be honest, if it works don’t break it – right?

In this paper, the authors study open source communities and check how often the reviewers actually agree upon the code review score. They find that it’s not that often – 37% disagree. From the paper: “How often do patches receive divergent scores? Results: Divergent review scores are not rare. Indeed, 15%–37% of the studied patch revisions that receive review scores of opposing polarity“

They also study how the divergence actually influences the patches – are they integrated or not: “Patches are integrated more often than they are abandoned. For example, patches that elicit positive and negative scores of equal strength are eventually integrated on average 71% of the time. The order in which review scores appear correlates with the integration rate, which tends to increase if negative scores precede positive ones. “

Finally, they study when the discussions/disagreements happen and how many reviewers there actually are: “Patches that are eventually integrated involve one or two more reviewers than patches without divergent scores on average. Moreover, positive scores appear before negative scores in 70% of patches with divergent scores. Reviewers may feel pressured to critique such patches before integration (e.g., due to lazy consensus).2 Finally, divergence tends to arise early, with 75% of them occurring by the third (QT) or fourth (OPENSTACK) revision. “

I think that these results say something about our community – that we tend to disagree, but do integrate the code anyways. What does that mean?

It could mean two things, which IMHO are equally valid:

The review comments do not really touch upon crucial aspects and therefore are deemed not so important (e.g. whether we call something weatherType or typeOfWeather as a variable…)
The reviewers’ reputation makes it difficult to get some of the comments through, e.g. when a junior reviewer is calling for a complete overhaul of the architecture.

Either way – I think that the modern code review field is quite active these days and I hope that we can get something done about the speed and quality of these long and tiresome code review processes.

Data labelling – activity that makes people hate ML….

Image by S. Hermann & F. Richter from Pixabay

Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies | SpringerLink

Machine learning is hungry for data. The more you have, the happier it will be. Seems very easy when we learn how to program ML and how it works – there is plenty of open data sources to practice and learn from.

However, when we want to use ML for our purposes, things get a bit more complicated. There is a lot of data, but not in the right format. The one that is in the right format is incomplete. The one that is complete, is noisy. The one that is not noisy is too little. We need to collect more. And so the story goes on, and on, and on….

Collecting the data is not that problematic, as it can often be automated. At least in software engineering, automotive, telecon, transport/logistic and medicine. These are the ones I know, anyways. What is problematic, though is data labelling. It is the activity where we take each data point and add a class to it, or its label if we speak machine-learnish. The person doing the labelling needs to be competent to be able to label the data correctly – he/she needs to know the domain, know the data, know the context. Then, this person also needs to have a fantastic memory, because the labels need to be consistent. They also need to be unambiguous given the underlying feature vector.

In this paper, colleagues from our department study the process of data labelling and its challenges.

They find the following to be selected examples of challenges:

Lack of a systematic approach to labeling data for specific features
Unclear responsibility for labeling
Noisy labels
Difficulty to find a correlation between labels and features
Skewed label distributions
Time dependence
Difficulty to predict future uses for datasets

I think it’s a great work and reading for everyone who wants to get into ML for real, start using it at a company and understand whether it’s actually gives any benefit.

From the abstract: Labeling is a cornerstone of supervised machine learning. However, in industrial applications, data is often not labeled, which complicates using this data for machine learning. Although there are well-established labeling techniques such as crowdsourcing, active learning, and semi-supervised learning, these still do not provide accurate and reliable labels for every machine learning use case in the industry. In this context, the industry still relies heavily on manually annotating and labeling their data. This study investigates the challenges that companies experience when annotating and labeling their data. We performed a case study using a semi-structured interview with data scientists at two companies to explore their problems when labeling and annotating their data. This paper provides two contributions. We identify industry challenges in the labeling process, and then we propose mitigation strategies for these challenges.

Testing machine learning systems…

https://rdcu.be/caKuc

Today, everybody is talking about machine learning and AI. Some talk about deterministic models, some about statistical ones, some about bayesian, some talk about X-mas 🙂

My experience with working with machine learning is that we need to be very careful what we actually do. If we do the machine learning in the classical sense, e.g. neural network models or decision trees. Then we need to make sure that we test the system alongside the data. Never together with the data. We need to prepare a dataset that we use as a reference and which we know well.

Testing, in that scenario, becomes just like we know it. We can make calculations manually, or just step-by-step, and we can check if the algorithm behaves like this.

Testing the system is also not difficult if we follow principles of good engineering – separation of concerns, modularization, observability.

In the runtime, we need to make sure that we add mechanisms related to such aspects as out-of-bounds distributions and safety cages for ML algorithms.

Either way, I recommend this article for all ML designers and product managers who want to know what’s the state of the art in this field, from the perspective of testing. A good overview, nice reading!

Who and when needs automated code reviews…

https://rdcu.be/caKsW

Having worked with code reviews for a while, I strongly sympathize with the thesis put forward by the authors of this paper – code review tools are still far from being supporting for software developers.

Yes, they do automate the process and organize it. Yes, they help in assuring that all code is reviewed and yes, they do help to capture problems in the code and help to spread the knowledge.

However, what I expect from such a tool is to help me to find problems in the code. I would like to have a tool that would help me, as a designer, get better: avoid mistakes, use cool programming constructs, make better design. None of the tools I know help with that.

This paper shows that my understanding is similar to the developers studied in the paper. Documentation – automatically fixing and suggesting were top priority. Renaming suggestions, commenting and explaining were some others.

Detection of duplicated code, architectural analysis and similar things were also mentioned as expectations. I cannot agree more! These things are priority 1 – I would also expect them to be there.

Now, some are more difficult that others – like analyzing the architecture. Not a trivial task at all, cause what is the architecture? Where are the patterns? How to find it from the code? How to rely on the tools that research provides? W’re not there yet.

Duplicate code, however, is something we should be able to fix. I’ve looked at some repository that had over 200 papers about code clones, duplicates and what have you. Are all these papers good? Probably not, but even if 10% is good, then here we have 20 tools we can try.

I agree, we do have SonarQube and similar tools, but they are not integrated with code review. I cannot just link to a report from SQ when writing a review comment. I cannot add a review comment to a detected technical debt in SQ. So, no integration then?

Maybe it’s just a friday afternoon thing, but I hope that we can get better in making the last mile with our tools. Hope that we can address the expectations that the developers have …