How to improve the code reviews

https://dl.acm.org/doi/pdf/10.1145/3660806

I have written a lot about code reviews. Not because this is my favourite activity, but because I think that there is a need for improvement there. Reading others’ code is not as fun as we may think and therefore making it a bit more interesting is much desirable.

This paper caught my attention because of the practical focus of it. In particular, the abstract caught my attention – they claim that changing the order of files presented for review makes a lot of difference. Up to 23% more comments are written when the files are arranged in the right order. Not only that, the quality of the comments seems to increase too. More tips:

  1. Re-order Files Based on Hot-Spot Prediction: The study found that reordering the files changed by a patch to prioritize hot-spots—files that are likely to require comments or revisions—improves the quality of code reviews. Implementing a system that automatically reorders files based on predicted hot-spots could make the review process more efficient, as it leads to more targeted comments and a better focus on critical areas.
  2. Focus on Size-Based Features: The study highlighted that size-based features (like the number of lines added or removed) are the most important when predicting review activities. Emphasizing these features when prioritizing files or creating models for review could further streamline the process.
  3. Utilize Large Language Models (LLMs): LLMs, such as those used for encoding text, have shown potential in capturing the essence of code changes more effectively than simpler models like Bag-of-Words. Incorporating LLMs into the review tools could improve the detection of complex or nuanced issues in the code.
  4. Automate Hot-Spot Detection and Highlighting: The positive impact of automatically identifying and prioritizing hot-spots suggests that integrating such automation into existing code review tools could significantly enhance the efficiency and effectiveness of the review process.

Sounds like this is one of the examples where we can see the large benefits of using LLMs in code reviews. I hope that this will make it into more companies than Ubisoft (partner on that paper).

I’ve asked ChatGPT to provide me an example of how to create such a hotspot model and it seems that this can be implemented in practice very easily. I will not paste it here, but please try for yourself.

Commander programmer and his army of bots… – Software Engineering in a post-GenAI era

Engineering software today becomes a profession where AI can make the most difference. GitHub and the availability of open source code, as well as natural language texts, provided the best possible fuel for creating large and great models. Some predict that we will not need programmers any more, but I predict that we will need them. I also think that these programmers will be much better than they are today. I view the future as quite optimistic and bright for our profession.

To begin with, I see the profession to change in a way that one single programmer will be more like a team leader for a number of LLMs or bots – hence the title. The programmer will use one model/bot for extracting requirements from standards, documents, twitters or other text sources. The requirement bot will help the programmer to find the user stories/requirements, to prioritize them and to set up a backlog.

Then, the programmer will use another bot to create high-level design. The bot will provide an initial design and will provide the programmer with some sort of conversational interface to reason about the design – the patterns, the architectural styles and the typical non-functional characteristics to maintain for the software.

At the same time, the programmer will use either the same bot or another one to write the source code of the software. We already use tools like GitHub CoPilot for these tasks and these tools will be even better. When constructing the software, the programmer will also use testing bots to create test cases and to improve the programs. Here, the work on program repair is definitely very interesting as it provides the ability to automatically improve the low-level design of the software.

Finally, when the software is complete, then during the release, the programmer will use bots to monitor the software. Finding defects fast, monitoring the performance, availability and security of software will be delegated to even more bots. They will do the work faster than any programmer anyways.

The job of the programmer will then be to instruct, integrate and monitor these bots. A little bit like a team leader who needs to make sure that all team members agree on certain principles. This means that the programmers will need to know more about the domains where they work as well as they will need access to tooling that supports their workflows.

What will also change is our perception of quality of software. If we can use these bots and make the software construction much faster, then we probably will not need to write super-maintainable code – the bots will be able to decipher even the most obscure code and help us to improve it. Hey, maybe these bots will even write a completely new version of our software once they learn how to optimize their design and when we learn how to link these bots to one another. Regardless, I do not think they will be able to write a Skynet kind of code….

On Security Weaknesses and Vulnerabilities inDeep Learning Systems

Image: wordcloud based on the text of the article

In the rapidly evolving artificial intelligence (AI), deep learning (DL) has become a cornerstone, driving advancements based on transformers and diffusers. However, the security of AI-enabled systems, particularly those utilizing deep learning techniques, are still questioned.

The authors conducted an extensive study, analyzing 3,049 vulnerabilities from the Common Vulnerabilities and Exposures (CVE) database and other sources. They employed a two-stream data analysis framework to identify patterns and understand the nature of these vulnerabilities. Their findings reveal that the decentralized and fragmented nature of DL frameworks contributes significantly to the security challenges.

The empirical study uncovered several patterns in DL vulnerabilities. Many issues stem from improper input validation, insecure dependencies, and inadequate security configurations. Additionally, the complexity of DL models makes it harder to apply conventional security measures effectively. The decentralized development environment further exacerbates these issues, as it leads to inconsistent security practices and fragmented responsibility.

It does make sense then to put a bit of effort into securing such systems. By the end of the day, input validation is no rocket science.

Federated learning in code summarization…

3661167.3661210 (acm.org)

So far, we have explored two different kinds of code summarization – either using a pre-trained model or training our own. However, both of them have severe limitations. The pre-trained models are often good, but too generic for the project at hand. The private models are good, but often require a lot of good data and processing power. In this article, the authors propose to use a third way – federated learning.

The results show that:

  • Fine-tuning LLMs with few parameters significantly improved code summarization capabilities. LoRA fine-tuning on 0.062% of parameters showed substantial performance gains in metrics like C-BLEU, METEOR, and ROUGE-L.
  • The federated model matched the performance of the centrally trained model within two federated rounds, indicating the viability of the federated approach for code summarization tasks.
  • The federated model achieved optimal performance at round 7, demonstrating that federated learning can be an effective method for training LLMs.
  • Federated fine-tuning on modest hardware (40GB GPU RAM) was feasible and efficient, with manageable run-times and memory consumption.

I need to take a look at this model a bit more since I like this idea. Maybe this is the beginning of the personalized bot-team that I always dreamt of?

Human-centric AI (article review)

Image by PublicDomainPictures from Pixabay

https://dl.acm.org/doi/pdf/10.1145/3664805

In artificial intelligence (AI), the conversation is shifting from mere technological advancements to the implications these innovations have on society. The paper “Human-Centric Artificial Intelligence: From Principles to Practice” focuses on the concept of designing AI systems that prioritize human values and societal well-being. It’s not my usual reading, but it caught my attention because of the title close to one of the programs that our faculty has.

Key Principles of Human-Centric AI

The paper outlines several core principles necessary for the development of human-centric AI:

  1. Transparency: AI systems must be transparent, providing clear insights into how decisions are made.
  2. Fairness: Ensuring that AI systems operate without bias and are equitable in their decision-making processes.
  3. Accountability: Developers and organizations must be accountable for the AI systems they create. This involves implementing mechanisms to monitor AI behavior and mitigate harm.
  4. Privacy: Protecting user data is paramount. AI systems should be designed to safeguard personal information and respect user privacy.
  5. Robustness: AI systems must be reliable and secure, capable of performing consistently under varying conditions and resilient to potential attacks.

It seems to me that the journey towards human-centric AI is still not taken, we have not achieved our goals. Balancing innovation with ethical considerations can be difficult, especially in a fast-paced technological landscape.

As we continue to integrate AI into more products, services and thus various aspects of society, the emphasis on human-centric principles will be crucial in ensuring that these technologies benefit humanity as a whole. We need to keep an eye on these developments.

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks

BIld av Michal Jarmoluk från Pixabay

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks | Automated Software Engineering (springer.com)

Labelling data, annotating images or text is a really tedious work. I don’t do it a lot, but when I do it, it takes time.

This paper presents a study of the extent to which mislabeled samples poison SE datasets and what it means for deep predictive models. The study also evaluates the effectiveness of current learning with noise (LwN) approaches, initially designed for AI datasets, in the context of software engineering.

The core of their investigation revolves around two primary datasets representative of the SE landscape: Bug Report Classification (BRC) and Software Defect Prediction (SDP). Mislabeled samples are not just present; they significantly alter the dataset, affecting everything from the class distribution to the overall data quality.

The implications of this study are interesting for developers and researchers as they offer a roadmap for navigating the challenges of data quality and model integrity in software engineering, ensuring that as we advance, our tools and models do so on a foundation of accurate and reliable data.

Sketches to models…

Image by 127071 from Pixabay

https://www.computer.org/csdl/proceedings-article/models/2023/248000a173/1SOLExN0XaU

It’s been a while since I worked with models and I looked a bit at how things have evolved. As I remember, one of the major problems with modelling was one of its broken promises – simplicity.

The whole idea with modelling was to be able to sketch things, discuss candidate solutions and then to transfer them on paper. However, in practice, this never worked like that – the sheer process to transfer a solution from the whiteboard to a computer took time. Maybe even so much time that it was not really worth the effort of informal sketches.

Now, we have CNNs and all kinds of ML algorithms, so why not use that? This paper studies exactly this.

The paper “SkeMo: Sketch Modeling for Real-Time Model Component Generation” by Alisha Sharma Chapai and Eric J. Rapos, presents an approach for automated and real-time model component generation from sketches. The approach is based on a convolutional neural network which can classify the sketches into model components, which is integrated into a web-based model editor, supporting a touch interface. The tool SkeMo has been validated by both calculating the accuracy of the classifier (the convolutional neural network) and through a user study with human participants. At the moment, the tool supports classes and their properties (including methods and attributes) and relationships between them. The prototype also allows updating models via non-sketch interactions with models. During the evaluation the classifier performed with an average precision of over 97%. The user study indicated the average accuracy of 94%, with the maximum accuracy for six subjects of 100%. This study shows how we can successfully employ machine learning into the process of modeling to make it more natural and agile for the users.

Generating documentation from notebooks

https://github.com/jyothivedurada/jyothivedurada.github.io/blob/main/papers/Cell2Doc.pdf

Understanding code is the same regardless if it is in a Jupyter notebook or if it is in another editor. Comments and documentation is the key. I try to teach that to my students and, some of them at least, appreciate it. Here is a paper that can change this to the better without adding to more effort.

This paper introduces a machine learning pipeline that automatically generates documentation for Python code cells in data science notebooks. Here’s a more casual summary of what they did and found:

  1. The Solution – Cell2Doc: The team whipped up a new tool called Cell2Doc. It’s a smart pipeline that breaks down code cells into logical parts and documents each bit separately. This way, it gets more details and can explain complex code better than other tools.
  2. How It Works: Cell2Doc has two main parts. First, a Code Segmentation Model (CoSEG) chops up the code into chunks that make sense on their own. Then, a Code Documentation Model (CoDoc) writes up explanations for each chunk. In the end, you get a full set of docs that covers everything the code is doing.
  3. The Cool Part: This isn’t just about slapping together existing models. Cell2Doc actually makes them better at writing docs for code. It’s like giving a turbo boost to the models so they can catch more details and write clearer explanations.
  4. Testing It Out: They didn’t just build this and hope for the best. They tested it with real data from Kaggle, a place where data scientists hang out and compete. They even made a new dataset for this kind of task because the old ones weren’t cutting it.
  5. The Results: When they put Cell2Doc to the test, it did a bang-up job. It scored way higher on automated tests than other methods, and real humans liked it better too. It was better at being correct, informative, and easy to read.
  6. Sharing Is Caring: They’re not keeping this to themselves. They’ve shared Cell2Doc so anyone can use it to make their code easier to understand.

In a nutshell, Cell2Doc is like a super-smart assistant that takes the headache out of writing docs for your code. It understands the code deeply and explains it in a way that’s easy to get, which is pretty awesome for keeping things clear and making sure your work can be used by others.

I consider to give this tool to my students in the sping when they learn how to program embedded systems in C.

Robustness in language interpretation using LLMs

https://arxiv.org/pdf/2309.10644.pdf

I’ve used language models for a while now. They are capable of many tasks, but one of their main problem is the robustness of the results. The models can produce very different results if we change only a minor detail.

This paper addresses the challenge of interpretability in deep learning models used for source code classification tasks such as functionality classification, authorship attribution, and vulnerability detection. The authors propose a novel method called Robin, which aims to create robust interpreters for deep learning-based code classifiers.

Key points from the paper include:

  1. Problem with Current Interpretability: The authors note that existing methods for interpreting deep learning models are not robust and struggle with out-of-distribution examples. This is a significant issue because practitioners need to trust the model’s predictions, especially in high-security scenarios.
  2. Robin’s Approach: Robin introduces a hybrid structure that combines an interpreter with two approximators. This structure leverages adversarial training and data augmentation to improve the robustness and fidelity of interpretations.
  3. Experimental Results: The paper reports that Robin achieves on average a 6.11% higher fidelity when evaluated on the classifier, 67.22% higher fidelity when evaluated on the approximator, and 15.87 times higher robustness compared to existing interpreters. Additionally, Robin is less affected by out-of-distribution examples.
  4. Contributions: The paper’s contributions are threefold: addressing the out-of-distribution problem, improving interpretation robustness, and empirically evaluating Robin’s effectiveness compared to known post-hoc methods.
  5. Motivating Instance: The authors provide a specific instance of code classification to illustrate the problem inherent to the local interpretation approach, demonstrating the need for a robust interpreter like Robin.
  6. Design of Robin: The paper details the design of Robin, which includes generating perturbed examples, leveraging adversarial training, and using mixup to augment the training set.
  7. Source Code Availability: The source code for Robin has been made publicly available, which can facilitate further research and application by other practitioners.
  8. Paper Organization: The paper is structured to present a motivating instance, describe the design of Robin, present experiments and results, discuss limitations, review related work, and conclude the study.

The authors conclude that Robin is a significant step forward in producing interpretable and robust deep learning models for code classification, which is crucial for their adoption in real-world applications, particularly those requiring high security.

Differential prompting for test case generation

https://arxiv.org/pdf/2304.11686.pdf

Generating test cases is one of the new areas where ChatGPT is gaining traction. It is a good thing as it allows software developers to quickly raise quality of their software.

This paper discusses the problem and challenges in finding failure-inducing test cases, the potential of using LLMs for software engineering tasks, and the limitations of ChatGPT in this context. It also provides insights into how the task of finding a failure-inducing test case can be facilitated if the program’s intention is known, and how ChatGPT’s weakness at recognizing nuances can be leveraged to infer a program’s intention.

The authors propose Differential Prompting as a new paradigm for finding failure-inducing test cases, which involves program intention inference, program generation, and differential testing. The evaluation of this technique on QuixBugs and Codeforces demonstrates its effectiveness, notably outperforming state-of-the-art baselines.

The contributions of the paper include the original study of ChatGPT’s effectiveness in finding failure-inducing test cases, the proposal of the Differential Prompting technique, and the evaluation of this technique on standard benchmarks.

The paper also acknowledges that Differential Prompting works best for simple programs and discusses its potential benefits in software engineering education. Preliminaries and methodology are provided to illustrate the task of finding failure-inducing test cases and the workflow of Differential Prompting.

The authors conclude with the promising application scenarios of Differential Prompting, suggesting that while it is currently best for simple programs, it is a step towards finding failure-inducing test cases for larger software. They also highlight its benefits for software engineering education.