Language models and security vulnerabilities – what works and what does not…. (article review)

BIld av Jan Alexander från Pixabay

1176898.pdf (hindawi.com)

Language models are powerful tools if you know how to use them. One of the areas where they can be used in recognizing security vulnerabilities. In this article, the authors look into six language models and test them.

The results show that there are more challenges than solutions in this area. The models can be applied to languages, but the problem is with the examples and the ground truth. What is good about the paper is that it provides a good overview of the models and how they are used. They also look a bit deeper on why the limitations of the models happen.

It’s something that our team has also observed in other context, but I will talk about that in some other event. Stay tuned.

So, you want to automate your security assessment (beyond pentesting)…

BIld av Darwin Laganzon från Pixabay

Automatic Security Assessment of GitHub Actions Workflows (arxiv.org)

After my last post, and the visit to the workshop at MDU, I realized that there are a few tools that can be used automatically already now. So, this paper presents one of them.

What is interesting about this tool is that it uses github workflows, so it’s compatible with many modern CI/CD pipelines. The tool analyzes worflows and looks for security vulnerabilities. For example, if you keep sensitive information in a plain text file that is used in the workflow (secrets), or checks if the workflow enforces the “least privilege” principle.

The implementation of the tool is OSS; can be found on github here: Mobile-IoT-Security-Lab/GHAST: GitHub Actions Security Tester

I need to test it as it looks very interesting. Maybe I can use this tool on some of the company’s workflows to test their exploitability score?

Code reviews and cybersecurity… (article highlight)

https://arxiv.org/pdf/2208.04261.pdf

So I find myself on the train again, this time strolling towards MDU for their cybersecurity workshop. Not that I am an expert on just cybersecurity, but I know a bit about programming and design. I also know this much to see that a secure product needs to start designing for security, not only testing for it.

I stumbled upon this paper about a week ago, probably as it has been submitted to some conference and the pre-print became available. It is a paper that interviews 10 developers and surveys over 180 professionals about how they work with finding security vulnerabilities during code reviews. I will not describe the entire article, although I wish I had the time to do that. Here are some of the highlights.

Interviewees stated to disregard security aspects during code reviews due to their assumptions about the security dynamic of the application they develop. ” – this is an interesting finding, as many companies see the code reviews as a golden bullet of software quality assurance today. Yet, the developers do not review something they thing “someone else” does…

When it comes to the survey, the results show that the majority of software developers think about security during their code reviews. The majority of the developers admit that there is no security experts reviewing their code, which is probably not great. Maybe we should have some of the security experts do some code reviews? Maybe both the developers and the security specialists would learn something from one another?

Finally, I think that the survey puts a finger on one of the pain points in modern companies – support for specific aspects of code reviews. They would like to see more support for the developers for making better security evaluations. I could only speculate that this is about in-depth training.

Well, very interesting reading. Let me get back to the paper, looking at the beautiful landscapes of Östergötland….

What are code reviews really good for?

Visualization of a source code of one module from the Cloudera projects. The embeddings are taken from our team’s neural network. t-SNE is a visualization technique taken from bioinformatics.

Concerns identified in code review: A fine-grained, faceted classification – ScienceDirect

Code reviews are time consuming. And effort intensive. And boring. And needed. Depending whom we ask, we get one of the above answers (well, 80% of the time). The reality is that the code reviews are not the most productive activity. Reading the code and looking for defects is good when we do it once, but when we need to work with it during continuous integration, the story changes. It becomes like studying for the exam or the homework – we do everything else to postpone it. Then someone waits longer or the code quality suffers.

There has been a lot of work done to make this activity more fun – gamification, automated support, using machine learning to filter out the code that we can automatically check – just to name the few. As far as I know, there has not been much work in understanding of what kind of problems code reviews really find.

In this article, the authors address that very question. Admittedly, they only analyzed 7 OSS projects, but their results are still interesting: “We identified 116 defect types that we grouped into 15 groups to create a defect classification. Additionally, 38% of these defects could be automatically detected accurately.

So, that basically means that 38% of defects could be identified by using testing or static analysis (or some other fancy automation technique). This figure summarizes their results (this is a link to the figure in sciencedirect): https://ars.els-cdn.com/content/image/1-s2.0-S0950584922001653-gr5_lrg.jpg

So, what the code reviews are good for? Here is their list:

  • threads,
  • header comments,
  • errors, warnings and logging,
  • test cases,
  • annotations,
  • performance,
  • identifier naming,
  • modifiers,
  • comments,
  • javadoc,
  • design,
  • implementation, and
  • logic and functionality

The list is sorted from the least frequent to the most frequent – so logic and functionality is the category where the code reviews are the most useful for. I need to also say that the frequencies are not super-high – threading is only 1 detected concern, while logic and functionality has 57. So, you know, could be more, given how much time is spent on code reviews. I guess it is what the quality costs nowadays, even though there is no real data on this.

Machine learning in compilers???

BenchPress: A Deep Active Benchmark Generator (arxiv.org)

To be honest, I did not expect machine learning to be part of a compiler… I’ve done programming since I was 13, understood compilers during my second year at the university and even wrote one (well, without any ML, that is).

Why would a compiler need machine learning, I wondered. It’s a pretty simple program – it takes a grammar, then parses the source code and translates that to a machine code (or some other low level representation). It has to be deterministic as the same program cannot compile to two different machine codes. It’s just the way it is….

It turns out that machine learning is used in modern compilers to perform optimizations. The optimizations are done to take advantage of modern processors, their registers and long instructions sets. These optimizations are meant to support machine code in being more parallel, allowing the modern multi-core, multi-thread processors to utilize every little bit of energy in all their cores.

In this paper, the authors use language models like BERT to create a benchmark that will allow different optimization techniques to be compared. This means, that the same compiler, can test itself against these benchmarks in order to find the best possible solution. Clever!

However, this is it from me. I’m planning on writing a compiler, let alone an optimizer. I may use BERT models in the future for generation of programs, but I will most probably end there. But, in case you wonder – there is ML in compilers 🙂

Language models in Software Engineering (new paper review)

Image by Lorenzo Cafaro from Pixabay

Articla available at: https://arxiv.org/pdf/2205.11739.pdf

It’s no secret that I’ve been fascinated by modern, BERT-like language models. I’ve seen what they can do and how they operate, use them in two of my research projects. So, when this paper came around, I read it directly.

It’s a paper which makes an overview of what kind of tasks the language models are used in software engineering today. The list is long and contains a variety of tasks, e.g., code-to-code retrieval, repairing of source code or bug finding/fixing. In total a lot of these tasks, but, IMHO, a bit low-level tasks. There are no tasks that attempt to understand code at the design-level, for example whether we can really see specific design in the code.

The paper also shows which models are used, and provides references to these models. They list 20 models, with the tasks for which they were trained, including the datasets that they were trained on. Fantastic!

I need to dive deeper into these models, but I’m super happy about the fact that there is a list of these models now and that the language technology makes a significant body of work in software engineering now.

Testing of ML systems

BIld av OpenClipart-Vectors från Pixabay

Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

Machine learning systems are very popular today, at least when it comes to research applications. They are not as popular as one would wished (or liked) in the real applications. One of the reasons is the fact that they are hard to test. We do not know how to check if an algorithm will behave as expected in all similar situations – well, we do not know which situations are similar for us and for the ML system.

This paper looks at the problem from a different angle. The research question is: RQ: What are simple and generic software tests that are capable of finding bugs and improving the quality of machine learning algorithms?

The authors developed a set of smoke tests, which they see that all ML algorithms should pass. The paper is quite exhaustive and if you are interested, I recommend to take a look at this table:

Table 1 | Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

I love the article. It is simple, to the point and very applied. I’m going to use that in my tests of ML algorithms in the future.

How good are language models for source code tasks?

https://ieeexplore-ieee-org.ezproxy.ub.gu.se/document/9653849

Using machine learning, and deep learning in particular, for software engineering tasks exploded recently. I would say that it exploded a bit too much. I’m myself to blame here as our team was one of the early adopters with the CCFlex model and source code analysis.

Well, this paper compares a number of modern deep learning models, so called transformers, in various code and comment analysis tasks. The authors did a great job in collecting a set of models and datasets, trained them and critically evaluated the performance.

I recommend reading the entire paper, but what they found was a bit surprised for me. First of all, they found that the transformer models are better for the natural language and not so great for the source code analysis. The hypothesis is that the structure of programs is important here. They have also found that pre-training is important, but not crucial. Pre-training attributes to a moderate effect in the end. The dataset, and its content, is much more important for the task at hand.

This is a great paper and I hope that this can become an essential reading for software engineers working with AI systems engineering supporting the software engineering tasks.

Reviewing rounds prediction for code patches

Image by pixabay

Reviewing rounds prediction for code patches | SpringerLink

Understanding how reviews of source code are done seem to be one of my main interests recently. Partly because reviews are important for software quality, while taking time. Partly, also, because I think it is interesting to check if we can quantify a good opinion from a good software developer.

In this paper, the authors study how to predict to which degree one can predict how many comments a given patch will have. Now, this problem may not be the most exciting ones, but it attracted my attention because of the fact that the authors studied the same projects as we did. However, to the contrary of our work, they also take into consideration features that characterize software developer networks – for example the experience in commenting software patches or the networking.

Now, to the results. The models presented in this paper seem to be quite good in quantifying and predicting patches within the same project – all kinds of predictions have pretty good F1-scores, above 65%. This means that we can train these models for our own projects and to be able to predict whether a particular patch will be commented on once or twice, or even many times.

The performance of the model on the cross-projects dataset. There, the performance is ok for predicting whether a particular patch will be commented on. Predicting how many times the patch will be commented on, or even that it will be commented on many times, does not work very well. The magnitude of the performance measures oscillates close to the 0% mark, which means that the models are not better than just guessing. I guess, you cannot have it all… from one model.

To sum up, the reason I read this article in more detail than others, was essentially not the performance, but tryin to understand the underlying techniques which they use. I’d like to say that they use a good set of features, which I recommend other to use (and will definitely use myself in the next studies), and the fact that they use simple language models, like word2vec, to understand the programming language. What I lack, though, is scrutinizing whether there is a statistically significant dependency between the sentiment (or even its strength) and the length of the discussion.

Rationality by Steven Pinker

Rationality: What It Is, Why It Seems Scarce, Why It Matters : Pinker, Steven: Amazon.se: Böcker

I’ve read this book because I wanted to get some inspiration from social sciences. What I ended up with is a book about Bayesian statistics and its consequences. To some extent, I’m happy to have read it, because I got a better view of how to use Bayesian statistics in practice. Yet, I am a bit disappointed. Not about the statistics, but about the book.

A while back I read the book by Judea Pearl about Bayesian networks and this role in statistics. That book was about something new and a bit fresh. It got my interest and I felt refreshed after I read it. This book was a bit too much of a repetition. Don’t get me wrong, I like repetition and I like the style of Steven Pinker (he is one of my role models when it comes to academic writing).

The book is about making inferences by calculating probabilities using Bayes theorem. It explain them very well and shows how they are used and misused in modern society – from politics to medical diagnoses. The book shows a few tricks to make better decisions and inferences.

I guess I need to read it once more in order to get a better taste of its distinct flavor.