September 2022 – SE metrics (Software Engineering)

50 Language/Code models, let’s talk…

As you have probably observed I’ve been into language models for code analysis, design and recognition. It’s a great way of spending your research time as it gives you the possibility to understand how we program and understand how to model that. In my personal case, this is a great complement to the empirical software engineering research that I do otherwise.

In the recent time I got a feeling that I look into more and more of these models, all of them baring certain similarity to the Google’s BERT model or the Fracebook’s TransCoder. So I set off to do a short review of the papers that actually talk about code models or, as they are often called, programming language models. I started from the paper describing CodeBERT ( [2002.08155] CodeBERT: A Pre-Trained Model for Programming and Natural Languages (arxiv.org) ) and looked at the 500 citations that the model has. The list below is just the list of the models that are created based on CodeBERT. There are also models created based on AlphaGo or Github CoPilot, but I leave these for another occasion.

I must admit that I did not read all of these papers and did not look at all of these models. Far from it, I only looked at some of them. My conclusion is that we have a lot of models, but the quality of the results vary a lot. The best models provide good results in ca. 20% of cases. AlphaCode is an example of such a model, which is fantastic, but not super-accurate all the time. As the model is used for super-competitive tasks, 20% is actually very impressive – it’s difficult to say that I would do better for these programming competitions, so I’m not criticizing.

The best model I’ve seen so far, however, is the Github CoPilot, which is by far the best model to create code that the world has seen. Well, there may be models that the world has not seen, but then they do not count. If you would like to see a preview of how I use it (part I), you can take a look at this video:

I sincerely hope that you find this list useful and that you can help me to keep it updated – drop me an e-mail about the list if you want to:

AlphaGo: https://www.deepmind.com/blog/competitive-programming-with-alphacode
TransCoder: https://github.com/facebookresearch/TransCoder
CodeT5: https://arxiv.org/pdf/2109.00859
CodeITT5: https://arxiv.org/pdf/2208.05446
ProphetNet: https://arxiv.org/pdf/2104.08006
Cotex: https://arxiv.org/pdf/2105.08645
Commit2vec: https://arxiv.org/pdf/1911.07605
CoreGen: https://www.sciencedirect.com/science/article/pii/S092523122100792X
SyncoBERT: https://arxiv.org/pdf/2108.04556
TreeBERT: https://proceedings.mlr.press/v161/jiang21a/jiang21a.pdf
FastSpec: https://ieeexplore.ieee.org/iel7/9581154/9581061/09581258.pdf
CVEFixes: https://dl.acm.org/doi/pdf/10.1145/3475960.3475985
CodeNet: https://arxiv.org/pdf/2105.12655
Graph4Code: https://www.researchgate.net/profile/Jamie-Mccusker-2/publication/339445570_Graph4Code_A_Machine_Interpretable_Knowledge_Graph_for_Code/links/5fd2a29a45851568d154cfaa/Graph4Code-A-Machine-Interpretable-Knowledge-Graph-for-Code.pdf
DeGraphCE: https://dl.acm.org/doi/pdf/10.1145/3546066
VELVET: https://ieeexplore.ieee.org/iel7/9825713/9825693/09825786.pdf
Code2Vec: https://uwspace.uwaterloo.ca/bitstream/handle/10012/15862/Arumugam_Lakshmanan.pdf?sequence=9&isAllowed=y
MulCode: https://ieeexplore.ieee.org/iel7/9425868/9425874/09426045.pdf
Flakify: https://ieeexplore.ieee.org/iel7/32/4359463/09866550.pdf
CoDesc: https://arxiv.org/pdf/2105.14220
NatGen: https://arxiv.org/pdf/2206.07585
Coctail: https://arxiv.org/pdf/2106.05345
MergeBERT: https://arxiv.org/pdf/2109.00084
SPTCode: https://dl.acm.org/doi/pdf/10.1145/3510003.3510096
InCoder: https://arxiv.org/pdf/2204.05999
JavaBERT: https://ieeexplore.ieee.org/iel7/9680270/9679822/09680322.pdf
BERT2Code: https://arxiv.org/pdf/2104.08017
NeuralCC: https://arxiv.org/pdf/2012.03225
LineVD: https://arxiv.org/pdf/2203.05181
GraphCode2Vec: https://arxiv.org/pdf/2112.01218
ASTBERT: https://arxiv.org/pdf/2201.07984
CodeRL: https://arxiv.org/pdf/2207.01780
CV4Code: https://arxiv.org/pdf/2205.08585
NaturalCC: https://xcodemind.github.io/papers/icse22_naturalcc_camera_submitted.pdf
StructCode: https://arxiv.org/pdf/2206.05239
VulBERT: https://arxiv.org/pdf/2205.12424
CodeMVP: https://arxiv.org/pdf/2205.02029
miBERT: https://ieeexplore.ieee.org/iel7/9787917/9787918/09787973.pdf?casa_token=rPNbu-k9Gh4AAAAA:3lkZVyUjnDP4Sp1UmmO9eVftsRaf1zAuw1YhHQogsyDBE2Y7992gBlhPb9jKVcI-5Q8tTv2JEyQ
LineVUL: https://www.researchgate.net/profile/Chakkrit-Tantithamthavorn/publication/359402890_LineVul_A_Transformer-based_Line-Level_Vulnerability_Prediction/links/623ee3d48068956f3c4cbede/LineVul-A-Transformer-based-Line-Level-Vulnerability-Prediction.pdf
CommitBART: https://arxiv.org/pdf/2208.08100
GAPGen: https://arxiv.org/pdf/2201.08810
El-CodeBERT: https://dl.acm.org/doi/pdf/10.1145/3545258.3545260?casa_token=DNyXQpkP69MAAAAA:y2iJC3RliEh7yJ6SzRpRRKrzPn2Q6w25vpm5vpoN0TksDh_SbmVfa_8JcDxvVN8FydOL_vTJqH-6OA
COCLUBERT: https://ieeexplore.ieee.org/iel7/9679834/9679948/09680081.pdf?casa_token=FtrqlHTmm74AAAAA:kkMyRsMl9xqPQQSBTRd6vFD-2-DyVSomYBYqm8u8aKs7B0_rkYYfL_OLVmOHgzn1-vqMF6W7pM8
Xcode: https://dl.acm.org/doi/pdf/10.1145/3506696?casa_token=5H8iW3e2GlYAAAAA:m2QA-DXSk5LZYazFxDPEVfLZcYREqDomXNg5YmkR-rPllHD37Qd8eLw_SCu6rbhNHZJ2Od24dvJt_Q
CobolBERT: https://arxiv.org/pdf/2201.09448
SiamBERT: https://melqkiades.github.io/files/download/papers/siambert-sais-2022.pdf
CodeReviewer: https://arxiv.org/pdf/2203.09095
CodeBERT-nt: https://arxiv.org/pdf/2208.06042
BashExplainer: https://arxiv.org/pdf/2206.13325

So, you want to automate your security assessment (beyond pentesting)…

Automatic Security Assessment of GitHub Actions Workflows (arxiv.org)

After my last post, and the visit to the workshop at MDU, I realized that there are a few tools that can be used automatically already now. So, this paper presents one of them.

What is interesting about this tool is that it uses github workflows, so it’s compatible with many modern CI/CD pipelines. The tool analyzes worflows and looks for security vulnerabilities. For example, if you keep sensitive information in a plain text file that is used in the workflow (secrets), or checks if the workflow enforces the “least privilege” principle.

The implementation of the tool is OSS; can be found on github here: Mobile-IoT-Security-Lab/GHAST: GitHub Actions Security Tester

I need to test it as it looks very interesting. Maybe I can use this tool on some of the company’s workflows to test their exploitability score?

Code reviews and cybersecurity… (article highlight)

https://arxiv.org/pdf/2208.04261.pdf

So I find myself on the train again, this time strolling towards MDU for their cybersecurity workshop. Not that I am an expert on just cybersecurity, but I know a bit about programming and design. I also know this much to see that a secure product needs to start designing for security, not only testing for it.

I stumbled upon this paper about a week ago, probably as it has been submitted to some conference and the pre-print became available. It is a paper that interviews 10 developers and surveys over 180 professionals about how they work with finding security vulnerabilities during code reviews. I will not describe the entire article, although I wish I had the time to do that. Here are some of the highlights.

“Interviewees stated to disregard security aspects during code reviews due to their assumptions about the security dynamic of the application they develop. ” – this is an interesting finding, as many companies see the code reviews as a golden bullet of software quality assurance today. Yet, the developers do not review something they thing “someone else” does…

When it comes to the survey, the results show that the majority of software developers think about security during their code reviews. The majority of the developers admit that there is no security experts reviewing their code, which is probably not great. Maybe we should have some of the security experts do some code reviews? Maybe both the developers and the security specialists would learn something from one another?

Finally, I think that the survey puts a finger on one of the pain points in modern companies – support for specific aspects of code reviews. They would like to see more support for the developers for making better security evaluations. I could only speculate that this is about in-depth training.

Well, very interesting reading. Let me get back to the paper, looking at the beautiful landscapes of Östergötland….

What are code reviews really good for?

Visualization of a source code of one module from the Cloudera projects. The embeddings are taken from our team’s neural network. t-SNE is a visualization technique taken from bioinformatics.

Concerns identified in code review: A fine-grained, faceted classification – ScienceDirect

Code reviews are time consuming. And effort intensive. And boring. And needed. Depending whom we ask, we get one of the above answers (well, 80% of the time). The reality is that the code reviews are not the most productive activity. Reading the code and looking for defects is good when we do it once, but when we need to work with it during continuous integration, the story changes. It becomes like studying for the exam or the homework – we do everything else to postpone it. Then someone waits longer or the code quality suffers.

There has been a lot of work done to make this activity more fun – gamification, automated support, using machine learning to filter out the code that we can automatically check – just to name the few. As far as I know, there has not been much work in understanding of what kind of problems code reviews really find.

In this article, the authors address that very question. Admittedly, they only analyzed 7 OSS projects, but their results are still interesting: “We identified 116 defect types that we grouped into 15 groups to create a defect classification. Additionally, 38% of these defects could be automatically detected accurately. “

So, that basically means that 38% of defects could be identified by using testing or static analysis (or some other fancy automation technique). This figure summarizes their results (this is a link to the figure in sciencedirect): https://ars.els-cdn.com/content/image/1-s2.0-S0950584922001653-gr5_lrg.jpg

So, what the code reviews are good for? Here is their list:

threads,
header comments,
errors, warnings and logging,
test cases,
annotations,
performance,
identifier naming,
modifiers,
comments,
javadoc,
design,
implementation, and
logic and functionality

The list is sorted from the least frequent to the most frequent – so logic and functionality is the category where the code reviews are the most useful for. I need to also say that the frequencies are not super-high – threading is only 1 detected concern, while logic and functionality has 57. So, you know, could be more, given how much time is spent on code reviews. I guess it is what the quality costs nowadays, even though there is no real data on this.