Can You Trust GPT with Your System Design? Testing AI’s Architectural IQ

Image by Vinson Tan ( 楊 祖 武 ) from Pixabay

https://ieeexplore.ieee.org/document/10978937

We’ve all seen Large Language Models (LLMs) write impressive snippets of code or debug a tricky function. But can an AI actually understand the soul of a system? Can it explain the “why” behind a complex architectural decision?

The paper, “Do Large Language Models Contain Software Architectural Knowledge? An Exploratory Case Study with GPT,” puts this to the test. Researchers did a study with 14 software engineers to see if GPT could navigate the Architectural Knowledge (AK) of a massive, real-world system: the Hadoop Distributed File System (HDFS).

The Experiment: AI vs. The Ground Truth
Engineers grilled GPT with questions ranging from basic component identification to deep design rationales. Their answers were then compared against a verified “ground truth” of HDFS documentation.

The Results
The study revealed a fascinating dichotomy in GPT’s performance: Recall was ok: GPT is surprisingly good at “remembering” things. It showed moderate recall, meaning it could often identify the correct architectural components and general concepts buried in its training data. Precision was really bad (guessing is much better): It struggled with accuracy. The model often suffered from lower precision, frequently providing answers that sounded authoritative but were technically incorrect or “hallucinated.”

When asked about design rationales (why a specific solution was chosen) or quality attribute solutions, GPT’s performance dipped significantly. It can tell you what is there, but it struggles to explain the engineering trade-offs.

The Takeaway for Architects
The engineers in the study rated GPT’s trustworthiness as only moderate. The verdict is clear: GPT is a fantastic tool for initial discovery and brainstorming, but it cannot be used as a source of truth for critical system design.

The Bottom Line is to treat LLMs as junior architects with a photographic memory but a shaky grasp of logic. They are great for a first draft, but expert human validation remains the most important step in the process.

Never eat alone, or else….

Image by Silviu on the street from Pixabay

Never Eat Alone: Keith Tahl Ferrazzi Raz: 9780241004951: Amazon.com: Books

In academia, the motto is “publish or perish”, with the emphasis on publishing. It’s for a good reason – we, academics, scholars, researchers, exist in a complex network of dependencies. We need others to get inspiration, understanding and when we get stuck.

If you look at the nobel prize winners, most of them work together. Listening to them I get an impression that you cannot become great by sitting in your own room and hatching ideas. But, at the same time, we are often introverts, at least I am.

This book is a great example of how we can build our networks and make meaningful connections. It helped me to realize how to be good at meaningful networking, not the one where you focus on meeting as many people as possible or as important people as possible. No, it’s about how to meet all kinds of people and how to learn from them. It’s about how to identify even a single item of information that you can use in your own work and for your own benefit.

I recommend this as a reading for one of those dark, autumn evenings that are inevitable coming now….

Materials that shape the world (of computing)

Material World: A Substantial Story of Our Past and Future: Ed Conway: 9780753559178: Amazon.com: Books

As a software engineer, I take hardware for granted. Moore’s law has taught me that all kind of computing power grows. My experience has taught me that all computing power is then consumed by frameworks, clouds and eventually is not enough.

This great book shades a really interesting light on the way in which materials like Lithium and Silicon shape our society. We think that TSMC is one of the isolated companies that excelled in chip-making. The reality is that this company is great, but it is also only one in a long chain of suppliers of the chip industry. We learn that the sand which is used to make chips comes from the US, not from Taiwan. We learn that the lithium used in our batteries comes often from the Andes, Chile, not from China. We also learn that the ONLY way for the humanity to progress is to collaborate cross-nations. If we don’t do that, no single country in the world has the machinery, the know-how and the competence to develop our modern technology.

It is in a series of great readings for software engineers when they start their studies today.

How innovations spread…

The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution: Isaacson, Walter: 9781476708706: Amazon.com: Books

Although this book is a bit dated – in the sense when we call everything that is pre-pandemic as dated – it is a great reading. It takes us on a journey of inventions in Silicon Valley, although it starts with Ada Byron and her work on computing machines.

I recommend this book because it goes against established theories in academia about innovations – that we innovate individually or in teams. Instead, it takes us on the journey of connections, research and innovation building on one another. It tells a great story how world’s technology evolves by taking one innovation and making another one. It is a story of global collaborations and how these collaborations are entangled with one another and support one another.

If it was up to me, this would be a mandatory reading for all new students of the university software engineering programs.

How to improve the code reviews

https://dl.acm.org/doi/pdf/10.1145/3660806

I have written a lot about code reviews. Not because this is my favourite activity, but because I think that there is a need for improvement there. Reading others’ code is not as fun as we may think and therefore making it a bit more interesting is much desirable.

This paper caught my attention because of the practical focus of it. In particular, the abstract caught my attention – they claim that changing the order of files presented for review makes a lot of difference. Up to 23% more comments are written when the files are arranged in the right order. Not only that, the quality of the comments seems to increase too. More tips:

  1. Re-order Files Based on Hot-Spot Prediction: The study found that reordering the files changed by a patch to prioritize hot-spots—files that are likely to require comments or revisions—improves the quality of code reviews. Implementing a system that automatically reorders files based on predicted hot-spots could make the review process more efficient, as it leads to more targeted comments and a better focus on critical areas.
  2. Focus on Size-Based Features: The study highlighted that size-based features (like the number of lines added or removed) are the most important when predicting review activities. Emphasizing these features when prioritizing files or creating models for review could further streamline the process.
  3. Utilize Large Language Models (LLMs): LLMs, such as those used for encoding text, have shown potential in capturing the essence of code changes more effectively than simpler models like Bag-of-Words. Incorporating LLMs into the review tools could improve the detection of complex or nuanced issues in the code.
  4. Automate Hot-Spot Detection and Highlighting: The positive impact of automatically identifying and prioritizing hot-spots suggests that integrating such automation into existing code review tools could significantly enhance the efficiency and effectiveness of the review process.

Sounds like this is one of the examples where we can see the large benefits of using LLMs in code reviews. I hope that this will make it into more companies than Ubisoft (partner on that paper).

I’ve asked ChatGPT to provide me an example of how to create such a hotspot model and it seems that this can be implemented in practice very easily. I will not paste it here, but please try for yourself.

On Security Weaknesses and Vulnerabilities inDeep Learning Systems

Image: wordcloud based on the text of the article

In the rapidly evolving artificial intelligence (AI), deep learning (DL) has become a cornerstone, driving advancements based on transformers and diffusers. However, the security of AI-enabled systems, particularly those utilizing deep learning techniques, are still questioned.

The authors conducted an extensive study, analyzing 3,049 vulnerabilities from the Common Vulnerabilities and Exposures (CVE) database and other sources. They employed a two-stream data analysis framework to identify patterns and understand the nature of these vulnerabilities. Their findings reveal that the decentralized and fragmented nature of DL frameworks contributes significantly to the security challenges.

The empirical study uncovered several patterns in DL vulnerabilities. Many issues stem from improper input validation, insecure dependencies, and inadequate security configurations. Additionally, the complexity of DL models makes it harder to apply conventional security measures effectively. The decentralized development environment further exacerbates these issues, as it leads to inconsistent security practices and fragmented responsibility.

It does make sense then to put a bit of effort into securing such systems. By the end of the day, input validation is no rocket science.

Federated learning in code summarization…

3661167.3661210 (acm.org)

So far, we have explored two different kinds of code summarization – either using a pre-trained model or training our own. However, both of them have severe limitations. The pre-trained models are often good, but too generic for the project at hand. The private models are good, but often require a lot of good data and processing power. In this article, the authors propose to use a third way – federated learning.

The results show that:

  • Fine-tuning LLMs with few parameters significantly improved code summarization capabilities. LoRA fine-tuning on 0.062% of parameters showed substantial performance gains in metrics like C-BLEU, METEOR, and ROUGE-L.
  • The federated model matched the performance of the centrally trained model within two federated rounds, indicating the viability of the federated approach for code summarization tasks.
  • The federated model achieved optimal performance at round 7, demonstrating that federated learning can be an effective method for training LLMs.
  • Federated fine-tuning on modest hardware (40GB GPU RAM) was feasible and efficient, with manageable run-times and memory consumption.

I need to take a look at this model a bit more since I like this idea. Maybe this is the beginning of the personalized bot-team that I always dreamt of?

Human-centric AI (article review)

Image by PublicDomainPictures from Pixabay

https://dl.acm.org/doi/pdf/10.1145/3664805

In artificial intelligence (AI), the conversation is shifting from mere technological advancements to the implications these innovations have on society. The paper “Human-Centric Artificial Intelligence: From Principles to Practice” focuses on the concept of designing AI systems that prioritize human values and societal well-being. It’s not my usual reading, but it caught my attention because of the title close to one of the programs that our faculty has.

Key Principles of Human-Centric AI

The paper outlines several core principles necessary for the development of human-centric AI:

  1. Transparency: AI systems must be transparent, providing clear insights into how decisions are made.
  2. Fairness: Ensuring that AI systems operate without bias and are equitable in their decision-making processes.
  3. Accountability: Developers and organizations must be accountable for the AI systems they create. This involves implementing mechanisms to monitor AI behavior and mitigate harm.
  4. Privacy: Protecting user data is paramount. AI systems should be designed to safeguard personal information and respect user privacy.
  5. Robustness: AI systems must be reliable and secure, capable of performing consistently under varying conditions and resilient to potential attacks.

It seems to me that the journey towards human-centric AI is still not taken, we have not achieved our goals. Balancing innovation with ethical considerations can be difficult, especially in a fast-paced technological landscape.

As we continue to integrate AI into more products, services and thus various aspects of society, the emphasis on human-centric principles will be crucial in ensuring that these technologies benefit humanity as a whole. We need to keep an eye on these developments.

Volvo Cars and CoPilot

Developers are Happier and More Satisfied in Their Coding Environment (microsoft.com)

I rarely summarize other blogg articles, but this one is an exception. I felt that things like that have been in the making, so this one is no surprise. Well, a bit of surprise, as this seems to be an experience of super-modern technology in a business where software has long been on the second place.

Based on the article, six months into its rollout, developers have reported significant efficiency gains, with some tasks like unit testing seeing up to a 40% increase in productivity. Copilot’s ability to assist with testing, explaining, and generating code has allowed developers to spend more time in a “flow state,” enhancing creativity and problem-solving.

Developers at Volvo Cars are happier and find their work more enjoyable, with 75% noting increased satisfaction. The tool has also improved communication among team members, fostering better interactions and sharper problem-solving.

Anyways, this shows that companies are no longer affraid of using generative AI technologies in practice. Let’s just wait and more of this.

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks

BIld av Michal Jarmoluk från Pixabay

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks | Automated Software Engineering (springer.com)

Labelling data, annotating images or text is a really tedious work. I don’t do it a lot, but when I do it, it takes time.

This paper presents a study of the extent to which mislabeled samples poison SE datasets and what it means for deep predictive models. The study also evaluates the effectiveness of current learning with noise (LwN) approaches, initially designed for AI datasets, in the context of software engineering.

The core of their investigation revolves around two primary datasets representative of the SE landscape: Bug Report Classification (BRC) and Software Defect Prediction (SDP). Mislabeled samples are not just present; they significantly alter the dataset, affecting everything from the class distribution to the overall data quality.

The implications of this study are interesting for developers and researchers as they offer a roadmap for navigating the challenges of data quality and model integrity in software engineering, ensuring that as we advance, our tools and models do so on a foundation of accurate and reliable data.