machine learning – Page 2 – SE metrics (Software Engineering)

How can AI see programming code… (article highlight)

A systematic mapping study of source code representation for deep learning in software engineering – Samoaa – 2022 – IET Software – Wiley Online Library

Understanding programming language is an important topic in research in the area of programming language models. I’ve written before that there are ca. 50 programming language models, which we can use in software engineering. Ok, not all of them are equivalent and they are specific to the task, but they are available, so we can use and customize them.

Now, whether 50 models is a lot or not is debatable. Compared to natural language models this is a small number. Even compared to the number of programming languages this number is not impressive. However, how many languages are used widely – 10-15? Java, C, C++, Python, JavaScript, Rust, Go, and derivatives are the most common ones.

This article is a study done by our colleagues from the department. It’s too long to quote in detail, but there are a few things that I like. First, it’s a good overview of the types of language models:

Token-based representation: when the program code is basically a set of tokens/words; some can have a special meaning, but they are just words (I’ve written about this before, even worked with it: GitHub – mochodek/py-ccflex: py-ccflex – Python Flexible Code Classifier )
Tree-based representation: when the program code is seen from the perspective of their Abstract-Syntax-Tree, an example is the code2vec model: code2vec
Graph-based models: when the program code is seen as a directed graph, e.g., a control flow graph

Although I like this classification, I see that it misses one of the most prominent and the most popular one – the NLP based model. It is a type of model where the program code is seen as a set of sentences that have meaning of some sort. It is a derivative of the token-based representation, but it is much more than that. CodeX from OpenAI is an example of such model.

Nevertheless, this study provides a very interesting set of examples of models and their applications. I sincerelly suggest to take a look at this paper to understand how the models work and where they are used best.

CoditT5: Pretraining for Source Code and Natural Language Editing

CoditT5: Pretraining for Source Code and Natural Language Editing (pengyunie.github.io)

I’ve written about programming language models before, and it is no secret that I am very much into this topic. I like the way in which software engineering evolves – we become a more mature discipline and our tools become smarter by the hour (at least that’s how it feels).

This paper presents a new language model that is capable of doing code edits, i.e., such things as bug fixes. The model is essentially a transformer with an architecture that has been published before. However, the strength of this model is in the way in which it is trained. It uses so-called edit plans to train the model to change the input code, rather than to complement it.

The difference may not sound like much, but it is significant. The existing models are trained to complete code sequences and therefore they are very good in generating code. However, when given a code that does not require any generation, they tend to copy the input sequence to the output sequence. Well, not very useful that is.

Thanks to this new way of training, the model is able to edit code, remove defects, address review comments and so on. Yes, address review comments, this is not a joke. I sincerely believe that we can use this in practice in our tools one day.

At the moment, you can find the code for this model here: GitHub – EngineeringSoftware/CoditT5: Code and data for “CoditT5: Pretraining for Source Code and Natural Language Editing” in ASE 2022.

Language models and security vulnerabilities – what works and what does not…. (article review)

1176898.pdf (hindawi.com)

Language models are powerful tools if you know how to use them. One of the areas where they can be used in recognizing security vulnerabilities. In this article, the authors look into six language models and test them.

The results show that there are more challenges than solutions in this area. The models can be applied to languages, but the problem is with the examples and the ground truth. What is good about the paper is that it provides a good overview of the models and how they are used. They also look a bit deeper on why the limitations of the models happen.

It’s something that our team has also observed in other context, but I will talk about that in some other event. Stay tuned.

50 Language/Code models, let’s talk…

As you have probably observed I’ve been into language models for code analysis, design and recognition. It’s a great way of spending your research time as it gives you the possibility to understand how we program and understand how to model that. In my personal case, this is a great complement to the empirical software engineering research that I do otherwise.

In the recent time I got a feeling that I look into more and more of these models, all of them baring certain similarity to the Google’s BERT model or the Fracebook’s TransCoder. So I set off to do a short review of the papers that actually talk about code models or, as they are often called, programming language models. I started from the paper describing CodeBERT ( [2002.08155] CodeBERT: A Pre-Trained Model for Programming and Natural Languages (arxiv.org) ) and looked at the 500 citations that the model has. The list below is just the list of the models that are created based on CodeBERT. There are also models created based on AlphaGo or Github CoPilot, but I leave these for another occasion.

I must admit that I did not read all of these papers and did not look at all of these models. Far from it, I only looked at some of them. My conclusion is that we have a lot of models, but the quality of the results vary a lot. The best models provide good results in ca. 20% of cases. AlphaCode is an example of such a model, which is fantastic, but not super-accurate all the time. As the model is used for super-competitive tasks, 20% is actually very impressive – it’s difficult to say that I would do better for these programming competitions, so I’m not criticizing.

The best model I’ve seen so far, however, is the Github CoPilot, which is by far the best model to create code that the world has seen. Well, there may be models that the world has not seen, but then they do not count. If you would like to see a preview of how I use it (part I), you can take a look at this video:

I sincerely hope that you find this list useful and that you can help me to keep it updated – drop me an e-mail about the list if you want to:

AlphaGo: https://www.deepmind.com/blog/competitive-programming-with-alphacode
TransCoder: https://github.com/facebookresearch/TransCoder
CodeT5: https://arxiv.org/pdf/2109.00859
CodeITT5: https://arxiv.org/pdf/2208.05446
ProphetNet: https://arxiv.org/pdf/2104.08006
Cotex: https://arxiv.org/pdf/2105.08645
Commit2vec: https://arxiv.org/pdf/1911.07605
CoreGen: https://www.sciencedirect.com/science/article/pii/S092523122100792X
SyncoBERT: https://arxiv.org/pdf/2108.04556
TreeBERT: https://proceedings.mlr.press/v161/jiang21a/jiang21a.pdf
FastSpec: https://ieeexplore.ieee.org/iel7/9581154/9581061/09581258.pdf
CVEFixes: https://dl.acm.org/doi/pdf/10.1145/3475960.3475985
CodeNet: https://arxiv.org/pdf/2105.12655
Graph4Code: https://www.researchgate.net/profile/Jamie-Mccusker-2/publication/339445570_Graph4Code_A_Machine_Interpretable_Knowledge_Graph_for_Code/links/5fd2a29a45851568d154cfaa/Graph4Code-A-Machine-Interpretable-Knowledge-Graph-for-Code.pdf
DeGraphCE: https://dl.acm.org/doi/pdf/10.1145/3546066
VELVET: https://ieeexplore.ieee.org/iel7/9825713/9825693/09825786.pdf
Code2Vec: https://uwspace.uwaterloo.ca/bitstream/handle/10012/15862/Arumugam_Lakshmanan.pdf?sequence=9&isAllowed=y
MulCode: https://ieeexplore.ieee.org/iel7/9425868/9425874/09426045.pdf
Flakify: https://ieeexplore.ieee.org/iel7/32/4359463/09866550.pdf
CoDesc: https://arxiv.org/pdf/2105.14220
NatGen: https://arxiv.org/pdf/2206.07585
Coctail: https://arxiv.org/pdf/2106.05345
MergeBERT: https://arxiv.org/pdf/2109.00084
SPTCode: https://dl.acm.org/doi/pdf/10.1145/3510003.3510096
InCoder: https://arxiv.org/pdf/2204.05999
JavaBERT: https://ieeexplore.ieee.org/iel7/9680270/9679822/09680322.pdf
BERT2Code: https://arxiv.org/pdf/2104.08017
NeuralCC: https://arxiv.org/pdf/2012.03225
LineVD: https://arxiv.org/pdf/2203.05181
GraphCode2Vec: https://arxiv.org/pdf/2112.01218
ASTBERT: https://arxiv.org/pdf/2201.07984
CodeRL: https://arxiv.org/pdf/2207.01780
CV4Code: https://arxiv.org/pdf/2205.08585
NaturalCC: https://xcodemind.github.io/papers/icse22_naturalcc_camera_submitted.pdf
StructCode: https://arxiv.org/pdf/2206.05239
VulBERT: https://arxiv.org/pdf/2205.12424
CodeMVP: https://arxiv.org/pdf/2205.02029
miBERT: https://ieeexplore.ieee.org/iel7/9787917/9787918/09787973.pdf?casa_token=rPNbu-k9Gh4AAAAA:3lkZVyUjnDP4Sp1UmmO9eVftsRaf1zAuw1YhHQogsyDBE2Y7992gBlhPb9jKVcI-5Q8tTv2JEyQ
LineVUL: https://www.researchgate.net/profile/Chakkrit-Tantithamthavorn/publication/359402890_LineVul_A_Transformer-based_Line-Level_Vulnerability_Prediction/links/623ee3d48068956f3c4cbede/LineVul-A-Transformer-based-Line-Level-Vulnerability-Prediction.pdf
CommitBART: https://arxiv.org/pdf/2208.08100
GAPGen: https://arxiv.org/pdf/2201.08810
El-CodeBERT: https://dl.acm.org/doi/pdf/10.1145/3545258.3545260?casa_token=DNyXQpkP69MAAAAA:y2iJC3RliEh7yJ6SzRpRRKrzPn2Q6w25vpm5vpoN0TksDh_SbmVfa_8JcDxvVN8FydOL_vTJqH-6OA
COCLUBERT: https://ieeexplore.ieee.org/iel7/9679834/9679948/09680081.pdf?casa_token=FtrqlHTmm74AAAAA:kkMyRsMl9xqPQQSBTRd6vFD-2-DyVSomYBYqm8u8aKs7B0_rkYYfL_OLVmOHgzn1-vqMF6W7pM8
Xcode: https://dl.acm.org/doi/pdf/10.1145/3506696?casa_token=5H8iW3e2GlYAAAAA:m2QA-DXSk5LZYazFxDPEVfLZcYREqDomXNg5YmkR-rPllHD37Qd8eLw_SCu6rbhNHZJ2Od24dvJt_Q
CobolBERT: https://arxiv.org/pdf/2201.09448
SiamBERT: https://melqkiades.github.io/files/download/papers/siambert-sais-2022.pdf
CodeReviewer: https://arxiv.org/pdf/2203.09095
CodeBERT-nt: https://arxiv.org/pdf/2208.06042
BashExplainer: https://arxiv.org/pdf/2206.13325

Machine learning in compilers???

BenchPress: A Deep Active Benchmark Generator (arxiv.org)

To be honest, I did not expect machine learning to be part of a compiler… I’ve done programming since I was 13, understood compilers during my second year at the university and even wrote one (well, without any ML, that is).

Why would a compiler need machine learning, I wondered. It’s a pretty simple program – it takes a grammar, then parses the source code and translates that to a machine code (or some other low level representation). It has to be deterministic as the same program cannot compile to two different machine codes. It’s just the way it is….

It turns out that machine learning is used in modern compilers to perform optimizations. The optimizations are done to take advantage of modern processors, their registers and long instructions sets. These optimizations are meant to support machine code in being more parallel, allowing the modern multi-core, multi-thread processors to utilize every little bit of energy in all their cores.

In this paper, the authors use language models like BERT to create a benchmark that will allow different optimization techniques to be compared. This means, that the same compiler, can test itself against these benchmarks in order to find the best possible solution. Clever!

However, this is it from me. I’m planning on writing a compiler, let alone an optimizer. I may use BERT models in the future for generation of programs, but I will most probably end there. But, in case you wonder – there is ML in compilers 🙂

Language models in Software Engineering (new paper review)

Articla available at: https://arxiv.org/pdf/2205.11739.pdf

It’s no secret that I’ve been fascinated by modern, BERT-like language models. I’ve seen what they can do and how they operate, use them in two of my research projects. So, when this paper came around, I read it directly.

It’s a paper which makes an overview of what kind of tasks the language models are used in software engineering today. The list is long and contains a variety of tasks, e.g., code-to-code retrieval, repairing of source code or bug finding/fixing. In total a lot of these tasks, but, IMHO, a bit low-level tasks. There are no tasks that attempt to understand code at the design-level, for example whether we can really see specific design in the code.

The paper also shows which models are used, and provides references to these models. They list 20 models, with the tasks for which they were trained, including the datasets that they were trained on. Fantastic!

I need to dive deeper into these models, but I’m super happy about the fact that there is a list of these models now and that the language technology makes a significant body of work in software engineering now.

Automating the Measurement of Heterogeneous Chatbot Designs (paper review)

Paper from: http://miso.es/pubs/ACMSAC_2022.pdf

Using chatbots has gained importance in recent years, which has resulted in development of several chatbot platforms (like Amazon Lex, Google DialogFlow or IBM Watson). However, there is a limited number of studies related to quality assurance of chatbots. The paper from Pablo C. Cañizares, Sara Pérez-Soler, Esther Guerra and Juan de Lara addresses just this problem – how to guide testing of chatbots by using design metrics.

The paper proposes six global metrics (e.g., number of intents of the bot), eight intent metrics (e.g., number of training phrases per intent), three entity metrics (e.g., word length), and three flow metrics (e.g., conversation length). By measuring the values for these metrics, software designers of chatbots can predict three usability types – effectiveness, efficiency and satisfaction. To support the measurement process, the paper proposes a tool, available on GitHub, which can be used by practitioners. For some of the metrics, the tool employs machine learning and natural language processing. The metrics and the tool are evaluated on twelve chatbot designs. The tool could identify quality issues in terms of readability, conversation complexity, user experience and bot understanding. This demonstrates the usefulness of the tool in practice and how these metrics can help software developers in designing high-quality bots.

The metrics from the paper are:

INT – # intents
ENT – # user-defined entities
FLOW – # conversation entry points
PATH – # different conversation flow paths
CNF – # confusing phrases
SNT – # positive, neutral, negative output phrases
TPI – # training phrases per intent
WPTP – # words per training phrase
VPTP – # verbs per training phrase
PPTP – # parameters per training phrase
WPOP – # words per output phrase
VPOP – # verbs per output phrase
CPOP – # characters per output phrase
READ – reading time of the output phrases
LPE – # literals per entity
SPL – # synonyms per literal
WL – word length
FACT – # actions per flow
FPATH – # conversation flow paths
CL – conversation length

I will try to use these metrics if I write chatbot 🙂

Testing of ML systems

BIld av OpenClipart-Vectors från Pixabay

Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

Machine learning systems are very popular today, at least when it comes to research applications. They are not as popular as one would wished (or liked) in the real applications. One of the reasons is the fact that they are hard to test. We do not know how to check if an algorithm will behave as expected in all similar situations – well, we do not know which situations are similar for us and for the ML system.

This paper looks at the problem from a different angle. The research question is: RQ: What are simple and generic software tests that are capable of finding bugs and improving the quality of machine learning algorithms?

The authors developed a set of smoke tests, which they see that all ML algorithms should pass. The paper is quite exhaustive and if you are interested, I recommend to take a look at this table:

Table 1 | Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

I love the article. It is simple, to the point and very applied. I’m going to use that in my tests of ML algorithms in the future.

How good are language models for source code tasks?

https://ieeexplore-ieee-org.ezproxy.ub.gu.se/document/9653849

Using machine learning, and deep learning in particular, for software engineering tasks exploded recently. I would say that it exploded a bit too much. I’m myself to blame here as our team was one of the early adopters with the CCFlex model and source code analysis.

Well, this paper compares a number of modern deep learning models, so called transformers, in various code and comment analysis tasks. The authors did a great job in collecting a set of models and datasets, trained them and critically evaluated the performance.

I recommend reading the entire paper, but what they found was a bit surprised for me. First of all, they found that the transformer models are better for the natural language and not so great for the source code analysis. The hypothesis is that the structure of programs is important here. They have also found that pre-training is important, but not crucial. Pre-training attributes to a moderate effect in the end. The dataset, and its content, is much more important for the task at hand.

This is a great paper and I hope that this can become an essential reading for software engineers working with AI systems engineering supporting the software engineering tasks.

autoML – let’s talk about it…

AutoML, a promise of green pastures, less work, optimal results. So, it is like that? In this post I share my view on this and experience from running the first test using that model.

First of all, let’s be honest, there is not such thing as a free lunch. In case of autoML (auto-sklearn), the price tag comes first with the effort, skills and time to install it and make it work. The second is the performance…. It’s painfully slow compared to your own models, simply because it tests a lot of models here and there. It also take a lot of time to download and to make it work.

But, first thing first, let me tell you where I start. So, I used the data from the MicroHRV project ( 3. MicroHRV: Recognizing Rare Events in Microwave Radio Links and Intensive Care Units using Machine Learning – Software Center (software-center.se)). The data is from patients being operated to remove clots of blood from the brain (although dangerous it may sound, the actual procedure is planned and calm). I wanted to check whether autoML can do better compared to what we have at the moment.

What we have at the moment (for that particular dataset) is: Accuracy: 0.98, Precision: 0.98, Recall: 0.98 – using Random Forest classifier. So, this is actually already very good. For the medical domain, that’s actually in class of its own, given our previous studies ended up with ca. 0.7 in accuracy at best.

When it comes to installing autoML – if you like stackoverflow, downgrading, upgrading, compiling, etc. and run Windows 10, then it’s your heaven. If you run Linux – no problems. Otherwise – stick to manual analyses:)

After two days (and nights) of trying, the best configuration was:

WSL – Windows Subsystem for Linux
Ubuntu 20, and
countless of oss libraries

It takes a while to get it to work, the question is whether the results are good enough…

After three hours of waiting, a lot of heat from my laptop, over 1,000 models tested resulted in Accuracy: 0.91, Precision: 0.94, Recall: 0.91

So, worse than my manual selection of models. I include the confusion matrices.

The matrices are not that different, as the validation sets are not that large either. However, it seems that the RF is still better than the best model from autoML.

I need work more on that and see if I do something wrong. However, I take this as a success – I’m better than autoML (still some use of an old professor) – instead of a let-down of not getting better results.

By the end of the day, 0.98 in accuracy is still very good!