50 Language/Code models, let’s talk…

BIld av pencil parker från Pixabay

As you have probably observed I’ve been into language models for code analysis, design and recognition. It’s a great way of spending your research time as it gives you the possibility to understand how we program and understand how to model that. In my personal case, this is a great complement to the empirical software engineering research that I do otherwise.

In the recent time I got a feeling that I look into more and more of these models, all of them baring certain similarity to the Google’s BERT model or the Fracebook’s TransCoder. So I set off to do a short review of the papers that actually talk about code models or, as they are often called, programming language models. I started from the paper describing CodeBERT ( [2002.08155] CodeBERT: A Pre-Trained Model for Programming and Natural Languages (arxiv.org) ) and looked at the 500 citations that the model has. The list below is just the list of the models that are created based on CodeBERT. There are also models created based on AlphaGo or Github CoPilot, but I leave these for another occasion.

I must admit that I did not read all of these papers and did not look at all of these models. Far from it, I only looked at some of them. My conclusion is that we have a lot of models, but the quality of the results vary a lot. The best models provide good results in ca. 20% of cases. AlphaCode is an example of such a model, which is fantastic, but not super-accurate all the time. As the model is used for super-competitive tasks, 20% is actually very impressive – it’s difficult to say that I would do better for these programming competitions, so I’m not criticizing.

The best model I’ve seen so far, however, is the Github CoPilot, which is by far the best model to create code that the world has seen. Well, there may be models that the world has not seen, but then they do not count. If you would like to see a preview of how I use it (part I), you can take a look at this video:

I sincerely hope that you find this list useful and that you can help me to keep it updated – drop me an e-mail about the list if you want to:

  1. AlphaGo: https://www.deepmind.com/blog/competitive-programming-with-alphacode
  2. TransCoder: https://github.com/facebookresearch/TransCoder
  3. CodeT5: https://arxiv.org/pdf/2109.00859 
  4. CodeITT5: https://arxiv.org/pdf/2208.05446 
  5. ProphetNet: https://arxiv.org/pdf/2104.08006 
  6. Cotex: https://arxiv.org/pdf/2105.08645 
  7. Commit2vec: https://arxiv.org/pdf/1911.07605 
  8. CoreGen: https://www.sciencedirect.com/science/article/pii/S092523122100792X  
  9. SyncoBERT: https://arxiv.org/pdf/2108.04556 
  10. TreeBERT: https://proceedings.mlr.press/v161/jiang21a/jiang21a.pdf 
  11. FastSpec: https://ieeexplore.ieee.org/iel7/9581154/9581061/09581258.pdf 
  12. CVEFixes: https://dl.acm.org/doi/pdf/10.1145/3475960.3475985 
  13. CodeNet: https://arxiv.org/pdf/2105.12655
  14. Graph4Code: https://www.researchgate.net/profile/Jamie-Mccusker-2/publication/339445570_Graph4Code_A_Machine_Interpretable_Knowledge_Graph_for_Code/links/5fd2a29a45851568d154cfaa/Graph4Code-A-Machine-Interpretable-Knowledge-Graph-for-Code.pdf 
  15. DeGraphCE: https://dl.acm.org/doi/pdf/10.1145/3546066 
  16. VELVET: https://ieeexplore.ieee.org/iel7/9825713/9825693/09825786.pdf
  17. Code2Vec: https://uwspace.uwaterloo.ca/bitstream/handle/10012/15862/Arumugam_Lakshmanan.pdf?sequence=9&isAllowed=y 
  18. MulCode: https://ieeexplore.ieee.org/iel7/9425868/9425874/09426045.pdf 
  19. Flakify: https://ieeexplore.ieee.org/iel7/32/4359463/09866550.pdf 
  20. CoDesc: https://arxiv.org/pdf/2105.14220 
  21. NatGen: https://arxiv.org/pdf/2206.07585 
  22. Coctail: https://arxiv.org/pdf/2106.05345 
  23. MergeBERT: https://arxiv.org/pdf/2109.00084 
  24. SPTCode: https://dl.acm.org/doi/pdf/10.1145/3510003.3510096 
  25. InCoder: https://arxiv.org/pdf/2204.05999 
  26. JavaBERT: https://ieeexplore.ieee.org/iel7/9680270/9679822/09680322.pdf 
  27. BERT2Code: https://arxiv.org/pdf/2104.08017 
  28. NeuralCC: https://arxiv.org/pdf/2012.03225 
  29. LineVD: https://arxiv.org/pdf/2203.05181 
  30. GraphCode2Vec: https://arxiv.org/pdf/2112.01218 
  31. ASTBERT: https://arxiv.org/pdf/2201.07984 
  32. CodeRL: https://arxiv.org/pdf/2207.01780 
  33. CV4Code: https://arxiv.org/pdf/2205.08585 
  34. NaturalCC: https://xcodemind.github.io/papers/icse22_naturalcc_camera_submitted.pdf 
  35. StructCode: https://arxiv.org/pdf/2206.05239   
  36. VulBERT: https://arxiv.org/pdf/2205.12424 
  37. CodeMVP: https://arxiv.org/pdf/2205.02029 
  38. miBERT: https://ieeexplore.ieee.org/iel7/9787917/9787918/09787973.pdf?casa_token=rPNbu-k9Gh4AAAAA:3lkZVyUjnDP4Sp1UmmO9eVftsRaf1zAuw1YhHQogsyDBE2Y7992gBlhPb9jKVcI-5Q8tTv2JEyQ 
  39. LineVUL: https://www.researchgate.net/profile/Chakkrit-Tantithamthavorn/publication/359402890_LineVul_A_Transformer-based_Line-Level_Vulnerability_Prediction/links/623ee3d48068956f3c4cbede/LineVul-A-Transformer-based-Line-Level-Vulnerability-Prediction.pdf 
  40. CommitBART: https://arxiv.org/pdf/2208.08100 
  41. GAPGen: https://arxiv.org/pdf/2201.08810 
  42. El-CodeBERT: https://dl.acm.org/doi/pdf/10.1145/3545258.3545260?casa_token=DNyXQpkP69MAAAAA:y2iJC3RliEh7yJ6SzRpRRKrzPn2Q6w25vpm5vpoN0TksDh_SbmVfa_8JcDxvVN8FydOL_vTJqH-6OA 
  43. COCLUBERT: https://ieeexplore.ieee.org/iel7/9679834/9679948/09680081.pdf?casa_token=FtrqlHTmm74AAAAA:kkMyRsMl9xqPQQSBTRd6vFD-2-DyVSomYBYqm8u8aKs7B0_rkYYfL_OLVmOHgzn1-vqMF6W7pM8 
  44. Xcode: https://dl.acm.org/doi/pdf/10.1145/3506696?casa_token=5H8iW3e2GlYAAAAA:m2QA-DXSk5LZYazFxDPEVfLZcYREqDomXNg5YmkR-rPllHD37Qd8eLw_SCu6rbhNHZJ2Od24dvJt_Q 
  45. CobolBERT: https://arxiv.org/pdf/2201.09448 
  46. SiamBERT: https://melqkiades.github.io/files/download/papers/siambert-sais-2022.pdf 
  47. CodeReviewer: https://arxiv.org/pdf/2203.09095 
  48. CodeBERT-nt: https://arxiv.org/pdf/2208.06042 
  49. BashExplainer: https://arxiv.org/pdf/2206.13325

Machine learning in compilers???

BenchPress: A Deep Active Benchmark Generator (arxiv.org)

To be honest, I did not expect machine learning to be part of a compiler… I’ve done programming since I was 13, understood compilers during my second year at the university and even wrote one (well, without any ML, that is).

Why would a compiler need machine learning, I wondered. It’s a pretty simple program – it takes a grammar, then parses the source code and translates that to a machine code (or some other low level representation). It has to be deterministic as the same program cannot compile to two different machine codes. It’s just the way it is….

It turns out that machine learning is used in modern compilers to perform optimizations. The optimizations are done to take advantage of modern processors, their registers and long instructions sets. These optimizations are meant to support machine code in being more parallel, allowing the modern multi-core, multi-thread processors to utilize every little bit of energy in all their cores.

In this paper, the authors use language models like BERT to create a benchmark that will allow different optimization techniques to be compared. This means, that the same compiler, can test itself against these benchmarks in order to find the best possible solution. Clever!

However, this is it from me. I’m planning on writing a compiler, let alone an optimizer. I may use BERT models in the future for generation of programs, but I will most probably end there. But, in case you wonder – there is ML in compilers 🙂

Language models in Software Engineering (new paper review)

Image by Lorenzo Cafaro from Pixabay

Articla available at: https://arxiv.org/pdf/2205.11739.pdf

It’s no secret that I’ve been fascinated by modern, BERT-like language models. I’ve seen what they can do and how they operate, use them in two of my research projects. So, when this paper came around, I read it directly.

It’s a paper which makes an overview of what kind of tasks the language models are used in software engineering today. The list is long and contains a variety of tasks, e.g., code-to-code retrieval, repairing of source code or bug finding/fixing. In total a lot of these tasks, but, IMHO, a bit low-level tasks. There are no tasks that attempt to understand code at the design-level, for example whether we can really see specific design in the code.

The paper also shows which models are used, and provides references to these models. They list 20 models, with the tasks for which they were trained, including the datasets that they were trained on. Fantastic!

I need to dive deeper into these models, but I’m super happy about the fact that there is a list of these models now and that the language technology makes a significant body of work in software engineering now.

Automating the Measurement of Heterogeneous Chatbot Designs (paper review)

Image by NPXL_Studio from Pixabay

Paper from: http://miso.es/pubs/ACMSAC_2022.pdf

Using chatbots has gained importance in recent years, which has resulted in development of several chatbot platforms (like Amazon Lex, Google DialogFlow or IBM Watson). However, there is a limited number of studies related to quality assurance of chatbots. The paper from Pablo C. Cañizares, Sara Pérez-Soler, Esther Guerra and Juan de Lara addresses just this problem – how to guide testing of chatbots by using design metrics.

The paper proposes six global metrics (e.g., number of intents of the bot), eight intent metrics (e.g., number of training phrases per intent), three entity metrics (e.g., word length), and three flow metrics (e.g., conversation length). By measuring the values for these metrics, software designers of chatbots can predict three usability types – effectiveness, efficiency and satisfaction. To support the measurement process, the paper proposes a tool, available on GitHub, which can be used by practitioners. For some of the metrics, the tool employs machine learning and natural language processing. The metrics and the tool are evaluated on twelve chatbot designs. The tool could identify quality issues in terms of readability, conversation complexity, user experience and bot understanding. This demonstrates the usefulness of the tool in practice and how these metrics can help software developers in designing high-quality bots.

The metrics from the paper are:

  • INT – # intents
  • ENT – # user-defined entities
  • FLOW – # conversation entry points
  • PATH – # different conversation flow paths
  • CNF – # confusing phrases
  • SNT – # positive, neutral, negative output phrases
  • TPI – # training phrases per intent
  • WPTP – # words per training phrase
  • VPTP – # verbs per training phrase
  • PPTP – # parameters per training phrase
  • WPOP – # words per output phrase
  • VPOP – # verbs per output phrase
  • CPOP – # characters per output phrase
  • READ – reading time of the output phrases
  • LPE – # literals per entity
  • SPL – # synonyms per literal
  • WL – word length
  • FACT – # actions per flow
  • FPATH – # conversation flow paths
  • CL – conversation length

I will try to use these metrics if I write chatbot 🙂

Testing of ML systems

BIld av OpenClipart-Vectors från Pixabay

Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

Machine learning systems are very popular today, at least when it comes to research applications. They are not as popular as one would wished (or liked) in the real applications. One of the reasons is the fact that they are hard to test. We do not know how to check if an algorithm will behave as expected in all similar situations – well, we do not know which situations are similar for us and for the ML system.

This paper looks at the problem from a different angle. The research question is: RQ: What are simple and generic software tests that are capable of finding bugs and improving the quality of machine learning algorithms?

The authors developed a set of smoke tests, which they see that all ML algorithms should pass. The paper is quite exhaustive and if you are interested, I recommend to take a look at this table:

Table 1 | Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

I love the article. It is simple, to the point and very applied. I’m going to use that in my tests of ML algorithms in the future.

How good are language models for source code tasks?


Using machine learning, and deep learning in particular, for software engineering tasks exploded recently. I would say that it exploded a bit too much. I’m myself to blame here as our team was one of the early adopters with the CCFlex model and source code analysis.

Well, this paper compares a number of modern deep learning models, so called transformers, in various code and comment analysis tasks. The authors did a great job in collecting a set of models and datasets, trained them and critically evaluated the performance.

I recommend reading the entire paper, but what they found was a bit surprised for me. First of all, they found that the transformer models are better for the natural language and not so great for the source code analysis. The hypothesis is that the structure of programs is important here. They have also found that pre-training is important, but not crucial. Pre-training attributes to a moderate effect in the end. The dataset, and its content, is much more important for the task at hand.

This is a great paper and I hope that this can become an essential reading for software engineers working with AI systems engineering supporting the software engineering tasks.

autoML – let’s talk about it…

Image from Pixabay

AutoML, a promise of green pastures, less work, optimal results. So, it is like that? In this post I share my view on this and experience from running the first test using that model.

First of all, let’s be honest, there is not such thing as a free lunch. In case of autoML (auto-sklearn), the price tag comes first with the effort, skills and time to install it and make it work. The second is the performance…. It’s painfully slow compared to your own models, simply because it tests a lot of models here and there. It also take a lot of time to download and to make it work.

But, first thing first, let me tell you where I start. So, I used the data from the MicroHRV project ( 3. MicroHRV: Recognizing Rare Events in Microwave Radio Links and Intensive Care Units using Machine Learning – Software Center (software-center.se)). The data is from patients being operated to remove clots of blood from the brain (although dangerous it may sound, the actual procedure is planned and calm). I wanted to check whether autoML can do better compared to what we have at the moment.

What we have at the moment (for that particular dataset) is: Accuracy: 0.98, Precision: 0.98, Recall: 0.98 – using Random Forest classifier. So, this is actually already very good. For the medical domain, that’s actually in class of its own, given our previous studies ended up with ca. 0.7 in accuracy at best.

When it comes to installing autoML – if you like stackoverflow, downgrading, upgrading, compiling, etc. and run Windows 10, then it’s your heaven. If you run Linux – no problems. Otherwise – stick to manual analyses:)

After two days (and nights) of trying, the best configuration was:

  • WSL – Windows Subsystem for Linux
  • Ubuntu 20, and
  • countless of oss libraries

It takes a while to get it to work, the question is whether the results are good enough…

After three hours of waiting, a lot of heat from my laptop, over 1,000 models tested resulted in Accuracy: 0.91, Precision: 0.94, Recall: 0.91

So, worse than my manual selection of models. I include the confusion matrices.

Random forest

The matrices are not that different, as the validation sets are not that large either. However, it seems that the RF is still better than the best model from autoML.

I need work more on that and see if I do something wrong. However, I take this as a success – I’m better than autoML (still some use of an old professor) – instead of a let-down of not getting better results.

By the end of the day, 0.98 in accuracy is still very good!

Reproducing AI models – a guideline

Image by Pete Linforth from Pixabay

2107.00821.pdf (arxiv.org)

Machine learning has been used in software engineering as a great tool for both research and development. The fact that we have access to TensorFlow, PyCharm, and other toolkits, provides almost endless possibilities. Combine that with the hundreds (if not thousands) of datasets from Zenodo and Co. and you can train a model for almost anything.

So far, so good, I would say. Problems (yes, there are always some problems) appear when we want to reproduce the results of others. Training a model on your own dataset and making it available is easy. Trusting such a model in a new context is not.

Imagine an example of an ML model trained on data from Company X. We have probably tuned the parameters a lot, so the model works great there, but does it work for Company Y? Most probably it will not. Well, it will work, but the performance of the predictions are not going to be great.

So, Google has partner up with academic partners to set up SIGMODELS, and TensorFlow garden, initiatives that are aimed at making ML models more portable, experiments more replicable, and all the other goodies.

In this paper, the authors provide a set of checks, which we can use to make the models more transparent, which is the first step towards reproducibility. In these guidelines, the authors advocate for reporting the models architecture, their input and output structure, building blocks, loss functions, etc.

Naturally, they also recommend to report metrics which were used to optimize the models, e.g. accuracy, F1-score, MCC or others. I know, these are probably essentials, but you would be surprised to see that many authors do not really report these metrics. If they are omitted, then how do we know if the metrics were just so poor that the authors omitted them (low performance of the model) or that they are not relevant (low relevance of the metrics – which is a good thing).

For now, these guidelines are only a draft, but I hope that they will become more mainstream. just like the emprical guidelines from ACM (GitHub – acmsigsoft/EmpiricalStandards: Empirical standards for conducting and evaluating research in software engineering).

What will the future bring ….


As my summer goes on, I’ve decided to take a look at the book of one of my favorite authors and scientists – Prof. Stephen Hawking. I’ve loved his books when I was younger and I like the way he could bring difficult theories to the masses, like his famous “gray holes” as opposed to the black ones.

This book allowed me to reflect on some of the most common questions that people ask, even to me. Like whether AI will take over or whether we should even invest in AI. Since I’m not a physicist, I cannot answer most of the questions, but I think the AI question is something that I can even attempt.

So, will AI take over? Is GPT-3 something to worry about? Will we be out of work as programmers? Well, not really. I think that we live in a world that is very diverse and that we need human judgement to make sure that we can live on. Take the recent cyber attack on Kaseya, which is a US-based company with thousands of clients. The attack affected a minority of their clients, some 40 or so (if I remember the article correctly). However, it make the entire COOP chain in Sweden stranded. Food was given away for free as there was no way to take payments. Other grocery shops bought the grocery stock from the affected chain, to make sure people have enough food. So, what would AI do?

Let’s think statistically for a moment. 40 customers, out of ca. 40,000, is about 0.1 percent. So, ignoring this event would just make 99.9% accuracy for AI. Is this good? Statistically, this is great! Almost perfect. For AI, therefore, this would be like a great way of optimizing. Not paying ransom, make sure that 0.1% is taken as a negligible error somewhere.

Now, let’s think about the social value. Without knowing the rest of the customers affected, or even the ones that were not affected, I could say that the value of having food on your table trumps many other kinds of value. Well, maybe not your health, but definitely something like a car or a computer game. So, the societal impact of that is large. We could model that in the AI, but there is programmatic problem. How to calculate value of diverse things – a car or food. There is the monetary value, of course, but it’s not constant in time. For someone who is hungry for days, the value of a sandwich is infinitely larger than the value of the same sandwich for someone who has just eaten a delicious stake. Another problem is that the value is dependent on the location (is there another grocery shop close by?), your stock (which is individual and hard to find for AI), or even the ability to use another system of payment (can I just get my groceries and pay later?)

This example shows an inherent problem in finding the right data to use for AI. I believe that this is a problem that will not really be solved. And if it cannot be solved, I think I would like to pay a few cents extra to have a human in the loop. I would liked to know that there is an option, in the event if a hacker attack, to talk to a person who understands my needs and can help me. Give me food without paying, knowing where I live and that I will pay later.

Until there are models which understand us, humans, we need to be able to stick to having humans in the loop. Given that there are ca. 6 billion people in the world, potentially different, with conflicting needs, I do not things AI will be able to help us in critical parts of the society.

New from ML?

I seldom write about films and events, well maybe actually never, but this year, a lot has happened in the online way.

What’s new in Machine Learning | Keynote – YouTube

The video above is about the news from Google about their TensorFlow library, which include new ways of training models, compression and performance tuning and more.

TensorFlow Light and TensorFlow JS allow us to use the same models as for desktops, but on mobile devices. Really impressive. I’ve caught myself thinking whether I’m more impressed by the hardware capabilities of small devices, or the capabilities of software. Either way – super cool.

Google is not the only company announcing something. NVidia is also showing a lot of cool features for enterprises. Cloud access for rapid prototyping, model testing and deployments are in the center of that.

NVIDIA Executive Keynote for Enterprise AI at COMPUTEX 2021 – YouTube

I like gaming, so this is impressing, but even more impressive is to look at the last-year’s DLSS technology, which still cannot be beaten by the competition. Really nice.