One of the most prominent problems with using research results in practice is the lack of replication packages, but this is far from being the only one. Another one, maybe an equally important problem, is the fact that the studies report performance in many different ways.
The paper that I wish to bring up is trying to address a similar aspect of software engineering. The paper reviews existing studies that provide recommendations, e.g., to report confusion matrices or to report statistical significance tests. Then it reviews some of the papers published in respected venues and then it provides actionable guidelines on how to evaluate the performance of machine learning models.
Language models are powerful tools if you know how to use them. One of the areas where they can be used in recognizing security vulnerabilities. In this article, the authors look into six language models and test them.
The results show that there are more challenges than solutions in this area. The models can be applied to languages, but the problem is with the examples and the ground truth. What is good about the paper is that it provides a good overview of the models and how they are used. They also look a bit deeper on why the limitations of the models happen.
It’s something that our team has also observed in other context, but I will talk about that in some other event. Stay tuned.
As you have probably observed I’ve been into language models for code analysis, design and recognition. It’s a great way of spending your research time as it gives you the possibility to understand how we program and understand how to model that. In my personal case, this is a great complement to the empirical software engineering research that I do otherwise.
In the recent time I got a feeling that I look into more and more of these models, all of them baring certain similarity to the Google’s BERT model or the Fracebook’s TransCoder. So I set off to do a short review of the papers that actually talk about code models or, as they are often called, programming language models. I started from the paper describing CodeBERT ( [2002.08155] CodeBERT: A Pre-Trained Model for Programming and Natural Languages (arxiv.org) ) and looked at the 500 citations that the model has. The list below is just the list of the models that are created based on CodeBERT. There are also models created based on AlphaGo or Github CoPilot, but I leave these for another occasion.
I must admit that I did not read all of these papers and did not look at all of these models. Far from it, I only looked at some of them. My conclusion is that we have a lot of models, but the quality of the results vary a lot. The best models provide good results in ca. 20% of cases. AlphaCode is an example of such a model, which is fantastic, but not super-accurate all the time. As the model is used for super-competitive tasks, 20% is actually very impressive – it’s difficult to say that I would do better for these programming competitions, so I’m not criticizing.
The best model I’ve seen so far, however, is the Github CoPilot, which is by far the best model to create code that the world has seen. Well, there may be models that the world has not seen, but then they do not count. If you would like to see a preview of how I use it (part I), you can take a look at this video:
I sincerely hope that you find this list useful and that you can help me to keep it updated – drop me an e-mail about the list if you want to:
After my last post, and the visit to the workshop at MDU, I realized that there are a few tools that can be used automatically already now. So, this paper presents one of them.
What is interesting about this tool is that it uses github workflows, so it’s compatible with many modern CI/CD pipelines. The tool analyzes worflows and looks for security vulnerabilities. For example, if you keep sensitive information in a plain text file that is used in the workflow (secrets), or checks if the workflow enforces the “least privilege” principle.
So I find myself on the train again, this time strolling towards MDU for their cybersecurity workshop. Not that I am an expert on just cybersecurity, but I know a bit about programming and design. I also know this much to see that a secure product needs to start designing for security, not only testing for it.
I stumbled upon this paper about a week ago, probably as it has been submitted to some conference and the pre-print became available. It is a paper that interviews 10 developers and surveys over 180 professionals about how they work with finding security vulnerabilities during code reviews. I will not describe the entire article, although I wish I had the time to do that. Here are some of the highlights.
“Interviewees stated to disregard security aspects during code reviews due to their assumptions about the security dynamic of the application they develop. ” – this is an interesting finding, as many companies see the code reviews as a golden bullet of software quality assurance today. Yet, the developers do not review something they thing “someone else” does…
When it comes to the survey, the results show that the majority of software developers think about security during their code reviews. The majority of the developers admit that there is no security experts reviewing their code, which is probably not great. Maybe we should have some of the security experts do some code reviews? Maybe both the developers and the security specialists would learn something from one another?
Finally, I think that the survey puts a finger on one of the pain points in modern companies – support for specific aspects of code reviews. They would like to see more support for the developers for making better security evaluations. I could only speculate that this is about in-depth training.
Well, very interesting reading. Let me get back to the paper, looking at the beautiful landscapes of Östergötland….
Code reviews are time consuming. And effort intensive. And boring. And needed. Depending whom we ask, we get one of the above answers (well, 80% of the time). The reality is that the code reviews are not the most productive activity. Reading the code and looking for defects is good when we do it once, but when we need to work with it during continuous integration, the story changes. It becomes like studying for the exam or the homework – we do everything else to postpone it. Then someone waits longer or the code quality suffers.
There has been a lot of work done to make this activity more fun – gamification, automated support, using machine learning to filter out the code that we can automatically check – just to name the few. As far as I know, there has not been much work in understanding of what kind of problems code reviews really find.
In this article, the authors address that very question. Admittedly, they only analyzed 7 OSS projects, but their results are still interesting: “We identified 116 defect types that we grouped into 15 groups to create a defect classification. Additionally, 38% of these defects could be automatically detected accurately. “
So, what the code reviews are good for? Here is their list:
errors, warnings and logging,
logic and functionality
The list is sorted from the least frequent to the most frequent – so logic and functionality is the category where the code reviews are the most useful for. I need to also say that the frequencies are not super-high – threading is only 1 detected concern, while logic and functionality has 57. So, you know, could be more, given how much time is spent on code reviews. I guess it is what the quality costs nowadays, even though there is no real data on this.
To be honest, I did not expect machine learning to be part of a compiler… I’ve done programming since I was 13, understood compilers during my second year at the university and even wrote one (well, without any ML, that is).
Why would a compiler need machine learning, I wondered. It’s a pretty simple program – it takes a grammar, then parses the source code and translates that to a machine code (or some other low level representation). It has to be deterministic as the same program cannot compile to two different machine codes. It’s just the way it is….
It turns out that machine learning is used in modern compilers to perform optimizations. The optimizations are done to take advantage of modern processors, their registers and long instructions sets. These optimizations are meant to support machine code in being more parallel, allowing the modern multi-core, multi-thread processors to utilize every little bit of energy in all their cores.
In this paper, the authors use language models like BERT to create a benchmark that will allow different optimization techniques to be compared. This means, that the same compiler, can test itself against these benchmarks in order to find the best possible solution. Clever!
However, this is it from me. I’m planning on writing a compiler, let alone an optimizer. I may use BERT models in the future for generation of programs, but I will most probably end there. But, in case you wonder – there is ML in compilers 🙂
Testing of neural networks is still an open problem. Due to the complexity of their connections, and their probabilistic nature, it is difficult to find defects. Although there is a lot of approaches, e.g., using autoencoders or using surprise adequacy measures, testing of neural networks is something of a mistery for me.
I could say that the topic was under my radar for a while. I actually though that there is not much need for testing research in software engineering; even though I run two projects with the testing components. For one, I thought that deep learning is basically like a “rabbit hole” – the more you test it, the more interesting properties you discover. I’ve tried to use testing to understand what kind of things the models learn, but I’m not sure that this is the right approach. I’m affraid that this will never be the case – the deep learning models learn something, we can evaluate it, but we can never really fully understand what the models has learned.
Now, this article uses mutation testing for the purpose to find the best test suite to validate the models. Well, it does more than that. It offers a framework where we can use three different models to evaluate the mutants and choose the ones that are expected to provide the best results. It is built on top of frameworks/models like DeepCrime (link here) and can provide a better selection approach. So far, the framework has been evaluated on the standard dataset – MNIST – but I hope that it will be expanded on other datasets in the future.
The summer is in full swing, and after a few weeks of leisure and relaxation, I’m back to work. In one of our research projects, we examine the ability to test deep learning systems for computer vision in autonomous drive systems. It’s been a challenge, as the field is rather scattered. There is a lot of work on testing DL systems but without the specifics of the safety or autonomous drive. At the same time, there are a lot of studies about testing autonomous systems – usually using simulations.
So, in this paper, the authors focus on using metamorphic testing to test DL networks. By manipulating the input images, they observe how the network reacts and what the predicted behavior is. This helps to establish some sort of boundaries regarding when the system is safe to operate and how it can behave in practice. It allows an understanding of which neurons were essentially activated in the network (which is not the same as network coverage).
The paper presents a tool for that purpose, which is something that I really need to try on our autoencoders from the DeVELOP project.
Articla available at: https://arxiv.org/pdf/2205.11739.pdf
It’s no secret that I’ve been fascinated by modern, BERT-like language models. I’ve seen what they can do and how they operate, use them in two of my research projects. So, when this paper came around, I read it directly.
It’s a paper which makes an overview of what kind of tasks the language models are used in software engineering today. The list is long and contains a variety of tasks, e.g., code-to-code retrieval, repairing of source code or bug finding/fixing. In total a lot of these tasks, but, IMHO, a bit low-level tasks. There are no tasks that attempt to understand code at the design-level, for example whether we can really see specific design in the code.
The paper also shows which models are used, and provides references to these models. They list 20 models, with the tasks for which they were trained, including the datasets that they were trained on. Fantastic!
I need to dive deeper into these models, but I’m super happy about the fact that there is a list of these models now and that the language technology makes a significant body of work in software engineering now.