Evaluating ML pipelines for real – spoiler alert: another pipeline (article review)

Evaluating classifiers in SE research: the ECSER pipeline and two replication studies (springer.com)

BIld av paula bassi från Pixabay

One of the most prominent problems with using research results in practice is the lack of replication packages, but this is far from being the only one. Another one, maybe an equally important problem, is the fact that the studies report performance in many different ways.

Since I have a chance to work with colleagues in medicine, I got to learn about their publication culture. It is more advanced than ours (software engineering), but that’s not the point. The main point is that they actually have guidelines on how to report ML studies. Here is an example of such a guideline: Clinician checklist for assessing suitability of machine learning applications in healthcare – PMC (nih.gov)

The paper that I wish to bring up is trying to address a similar aspect of software engineering. The paper reviews existing studies that provide recommendations, e.g., to report confusion matrices or to report statistical significance tests. Then it reviews some of the papers published in respected venues and then it provides actionable guidelines on how to evaluate the performance of machine learning models.

50 Language/Code models, let’s talk…

BIld av pencil parker från Pixabay

As you have probably observed I’ve been into language models for code analysis, design and recognition. It’s a great way of spending your research time as it gives you the possibility to understand how we program and understand how to model that. In my personal case, this is a great complement to the empirical software engineering research that I do otherwise.

In the recent time I got a feeling that I look into more and more of these models, all of them baring certain similarity to the Google’s BERT model or the Fracebook’s TransCoder. So I set off to do a short review of the papers that actually talk about code models or, as they are often called, programming language models. I started from the paper describing CodeBERT ( [2002.08155] CodeBERT: A Pre-Trained Model for Programming and Natural Languages (arxiv.org) ) and looked at the 500 citations that the model has. The list below is just the list of the models that are created based on CodeBERT. There are also models created based on AlphaGo or Github CoPilot, but I leave these for another occasion.

I must admit that I did not read all of these papers and did not look at all of these models. Far from it, I only looked at some of them. My conclusion is that we have a lot of models, but the quality of the results vary a lot. The best models provide good results in ca. 20% of cases. AlphaCode is an example of such a model, which is fantastic, but not super-accurate all the time. As the model is used for super-competitive tasks, 20% is actually very impressive – it’s difficult to say that I would do better for these programming competitions, so I’m not criticizing.

The best model I’ve seen so far, however, is the Github CoPilot, which is by far the best model to create code that the world has seen. Well, there may be models that the world has not seen, but then they do not count. If you would like to see a preview of how I use it (part I), you can take a look at this video:

I sincerely hope that you find this list useful and that you can help me to keep it updated – drop me an e-mail about the list if you want to:

  1. AlphaGo: https://www.deepmind.com/blog/competitive-programming-with-alphacode
  2. TransCoder: https://github.com/facebookresearch/TransCoder
  3. CodeT5: https://arxiv.org/pdf/2109.00859 
  4. CodeITT5: https://arxiv.org/pdf/2208.05446 
  5. ProphetNet: https://arxiv.org/pdf/2104.08006 
  6. Cotex: https://arxiv.org/pdf/2105.08645 
  7. Commit2vec: https://arxiv.org/pdf/1911.07605 
  8. CoreGen: https://www.sciencedirect.com/science/article/pii/S092523122100792X  
  9. SyncoBERT: https://arxiv.org/pdf/2108.04556 
  10. TreeBERT: https://proceedings.mlr.press/v161/jiang21a/jiang21a.pdf 
  11. FastSpec: https://ieeexplore.ieee.org/iel7/9581154/9581061/09581258.pdf 
  12. CVEFixes: https://dl.acm.org/doi/pdf/10.1145/3475960.3475985 
  13. CodeNet: https://arxiv.org/pdf/2105.12655
  14. Graph4Code: https://www.researchgate.net/profile/Jamie-Mccusker-2/publication/339445570_Graph4Code_A_Machine_Interpretable_Knowledge_Graph_for_Code/links/5fd2a29a45851568d154cfaa/Graph4Code-A-Machine-Interpretable-Knowledge-Graph-for-Code.pdf 
  15. DeGraphCE: https://dl.acm.org/doi/pdf/10.1145/3546066 
  16. VELVET: https://ieeexplore.ieee.org/iel7/9825713/9825693/09825786.pdf
  17. Code2Vec: https://uwspace.uwaterloo.ca/bitstream/handle/10012/15862/Arumugam_Lakshmanan.pdf?sequence=9&isAllowed=y 
  18. MulCode: https://ieeexplore.ieee.org/iel7/9425868/9425874/09426045.pdf 
  19. Flakify: https://ieeexplore.ieee.org/iel7/32/4359463/09866550.pdf 
  20. CoDesc: https://arxiv.org/pdf/2105.14220 
  21. NatGen: https://arxiv.org/pdf/2206.07585 
  22. Coctail: https://arxiv.org/pdf/2106.05345 
  23. MergeBERT: https://arxiv.org/pdf/2109.00084 
  24. SPTCode: https://dl.acm.org/doi/pdf/10.1145/3510003.3510096 
  25. InCoder: https://arxiv.org/pdf/2204.05999 
  26. JavaBERT: https://ieeexplore.ieee.org/iel7/9680270/9679822/09680322.pdf 
  27. BERT2Code: https://arxiv.org/pdf/2104.08017 
  28. NeuralCC: https://arxiv.org/pdf/2012.03225 
  29. LineVD: https://arxiv.org/pdf/2203.05181 
  30. GraphCode2Vec: https://arxiv.org/pdf/2112.01218 
  31. ASTBERT: https://arxiv.org/pdf/2201.07984 
  32. CodeRL: https://arxiv.org/pdf/2207.01780 
  33. CV4Code: https://arxiv.org/pdf/2205.08585 
  34. NaturalCC: https://xcodemind.github.io/papers/icse22_naturalcc_camera_submitted.pdf 
  35. StructCode: https://arxiv.org/pdf/2206.05239   
  36. VulBERT: https://arxiv.org/pdf/2205.12424 
  37. CodeMVP: https://arxiv.org/pdf/2205.02029 
  38. miBERT: https://ieeexplore.ieee.org/iel7/9787917/9787918/09787973.pdf?casa_token=rPNbu-k9Gh4AAAAA:3lkZVyUjnDP4Sp1UmmO9eVftsRaf1zAuw1YhHQogsyDBE2Y7992gBlhPb9jKVcI-5Q8tTv2JEyQ 
  39. LineVUL: https://www.researchgate.net/profile/Chakkrit-Tantithamthavorn/publication/359402890_LineVul_A_Transformer-based_Line-Level_Vulnerability_Prediction/links/623ee3d48068956f3c4cbede/LineVul-A-Transformer-based-Line-Level-Vulnerability-Prediction.pdf 
  40. CommitBART: https://arxiv.org/pdf/2208.08100 
  41. GAPGen: https://arxiv.org/pdf/2201.08810 
  42. El-CodeBERT: https://dl.acm.org/doi/pdf/10.1145/3545258.3545260?casa_token=DNyXQpkP69MAAAAA:y2iJC3RliEh7yJ6SzRpRRKrzPn2Q6w25vpm5vpoN0TksDh_SbmVfa_8JcDxvVN8FydOL_vTJqH-6OA 
  43. COCLUBERT: https://ieeexplore.ieee.org/iel7/9679834/9679948/09680081.pdf?casa_token=FtrqlHTmm74AAAAA:kkMyRsMl9xqPQQSBTRd6vFD-2-DyVSomYBYqm8u8aKs7B0_rkYYfL_OLVmOHgzn1-vqMF6W7pM8 
  44. Xcode: https://dl.acm.org/doi/pdf/10.1145/3506696?casa_token=5H8iW3e2GlYAAAAA:m2QA-DXSk5LZYazFxDPEVfLZcYREqDomXNg5YmkR-rPllHD37Qd8eLw_SCu6rbhNHZJ2Od24dvJt_Q 
  45. CobolBERT: https://arxiv.org/pdf/2201.09448 
  46. SiamBERT: https://melqkiades.github.io/files/download/papers/siambert-sais-2022.pdf 
  47. CodeReviewer: https://arxiv.org/pdf/2203.09095 
  48. CodeBERT-nt: https://arxiv.org/pdf/2208.06042 
  49. BashExplainer: https://arxiv.org/pdf/2206.13325

So, you want to automate your security assessment (beyond pentesting)…

BIld av Darwin Laganzon från Pixabay

Automatic Security Assessment of GitHub Actions Workflows (arxiv.org)

After my last post, and the visit to the workshop at MDU, I realized that there are a few tools that can be used automatically already now. So, this paper presents one of them.

What is interesting about this tool is that it uses github workflows, so it’s compatible with many modern CI/CD pipelines. The tool analyzes worflows and looks for security vulnerabilities. For example, if you keep sensitive information in a plain text file that is used in the workflow (secrets), or checks if the workflow enforces the “least privilege” principle.

The implementation of the tool is OSS; can be found on github here: Mobile-IoT-Security-Lab/GHAST: GitHub Actions Security Tester

I need to test it as it looks very interesting. Maybe I can use this tool on some of the company’s workflows to test their exploitability score?

What are code reviews really good for?

Visualization of a source code of one module from the Cloudera projects. The embeddings are taken from our team’s neural network. t-SNE is a visualization technique taken from bioinformatics.

Concerns identified in code review: A fine-grained, faceted classification – ScienceDirect

Code reviews are time consuming. And effort intensive. And boring. And needed. Depending whom we ask, we get one of the above answers (well, 80% of the time). The reality is that the code reviews are not the most productive activity. Reading the code and looking for defects is good when we do it once, but when we need to work with it during continuous integration, the story changes. It becomes like studying for the exam or the homework – we do everything else to postpone it. Then someone waits longer or the code quality suffers.

There has been a lot of work done to make this activity more fun – gamification, automated support, using machine learning to filter out the code that we can automatically check – just to name the few. As far as I know, there has not been much work in understanding of what kind of problems code reviews really find.

In this article, the authors address that very question. Admittedly, they only analyzed 7 OSS projects, but their results are still interesting: “We identified 116 defect types that we grouped into 15 groups to create a defect classification. Additionally, 38% of these defects could be automatically detected accurately.

So, that basically means that 38% of defects could be identified by using testing or static analysis (or some other fancy automation technique). This figure summarizes their results (this is a link to the figure in sciencedirect): https://ars.els-cdn.com/content/image/1-s2.0-S0950584922001653-gr5_lrg.jpg

So, what the code reviews are good for? Here is their list:

  • threads,
  • header comments,
  • errors, warnings and logging,
  • test cases,
  • annotations,
  • performance,
  • identifier naming,
  • modifiers,
  • comments,
  • javadoc,
  • design,
  • implementation, and
  • logic and functionality

The list is sorted from the least frequent to the most frequent – so logic and functionality is the category where the code reviews are the most useful for. I need to also say that the frequencies are not super-high – threading is only 1 detected concern, while logic and functionality has 57. So, you know, could be more, given how much time is spent on code reviews. I guess it is what the quality costs nowadays, even though there is no real data on this.

Machine learning in compilers???

BenchPress: A Deep Active Benchmark Generator (arxiv.org)

To be honest, I did not expect machine learning to be part of a compiler… I’ve done programming since I was 13, understood compilers during my second year at the university and even wrote one (well, without any ML, that is).

Why would a compiler need machine learning, I wondered. It’s a pretty simple program – it takes a grammar, then parses the source code and translates that to a machine code (or some other low level representation). It has to be deterministic as the same program cannot compile to two different machine codes. It’s just the way it is….

It turns out that machine learning is used in modern compilers to perform optimizations. The optimizations are done to take advantage of modern processors, their registers and long instructions sets. These optimizations are meant to support machine code in being more parallel, allowing the modern multi-core, multi-thread processors to utilize every little bit of energy in all their cores.

In this paper, the authors use language models like BERT to create a benchmark that will allow different optimization techniques to be compared. This means, that the same compiler, can test itself against these benchmarks in order to find the best possible solution. Clever!

However, this is it from me. I’m planning on writing a compiler, let alone an optimizer. I may use BERT models in the future for generation of programs, but I will most probably end there. But, in case you wonder – there is ML in compilers 🙂

Testing deep neural networks (article highlight)

A Probabilistic Framework for Mutation Testing in Deep Neural Networks (arxiv.org)

Testing of neural networks is still an open problem. Due to the complexity of their connections, and their probabilistic nature, it is difficult to find defects. Although there is a lot of approaches, e.g., using autoencoders or using surprise adequacy measures, testing of neural networks is something of a mistery for me.

I could say that the topic was under my radar for a while. I actually though that there is not much need for testing research in software engineering; even though I run two projects with the testing components. For one, I thought that deep learning is basically like a “rabbit hole” – the more you test it, the more interesting properties you discover. I’ve tried to use testing to understand what kind of things the models learn, but I’m not sure that this is the right approach. I’m affraid that this will never be the case – the deep learning models learn something, we can evaluate it, but we can never really fully understand what the models has learned.

Now, this article uses mutation testing for the purpose to find the best test suite to validate the models. Well, it does more than that. It offers a framework where we can use three different models to evaluate the mutants and choose the ones that are expected to provide the best results. It is built on top of frameworks/models like DeepCrime (link here) and can provide a better selection approach. So far, the framework has been evaluated on the standard dataset – MNIST – but I hope that it will be expanded on other datasets in the future.

Testing deep learning systems in automotive software – article highlight

BIld av Pexels från Pixabay

IEEE Xplore Full-Text PDF:

The summer is in full swing, and after a few weeks of leisure and relaxation, I’m back to work. In one of our research projects, we examine the ability to test deep learning systems for computer vision in autonomous drive systems. It’s been a challenge, as the field is rather scattered. There is a lot of work on testing DL systems but without the specifics of the safety or autonomous drive. At the same time, there are a lot of studies about testing autonomous systems – usually using simulations.

So, in this paper, the authors focus on using metamorphic testing to test DL networks. By manipulating the input images, they observe how the network reacts and what the predicted behavior is. This helps to establish some sort of boundaries regarding when the system is safe to operate and how it can behave in practice. It allows an understanding of which neurons were essentially activated in the network (which is not the same as network coverage).

The paper presents a tool for that purpose, which is something that I really need to try on our autoencoders from the DeVELOP project.

Automating the Measurement of Heterogeneous Chatbot Designs (paper review)

Image by NPXL_Studio from Pixabay

Paper from: http://miso.es/pubs/ACMSAC_2022.pdf

Using chatbots has gained importance in recent years, which has resulted in development of several chatbot platforms (like Amazon Lex, Google DialogFlow or IBM Watson). However, there is a limited number of studies related to quality assurance of chatbots. The paper from Pablo C. Cañizares, Sara Pérez-Soler, Esther Guerra and Juan de Lara addresses just this problem – how to guide testing of chatbots by using design metrics.

The paper proposes six global metrics (e.g., number of intents of the bot), eight intent metrics (e.g., number of training phrases per intent), three entity metrics (e.g., word length), and three flow metrics (e.g., conversation length). By measuring the values for these metrics, software designers of chatbots can predict three usability types – effectiveness, efficiency and satisfaction. To support the measurement process, the paper proposes a tool, available on GitHub, which can be used by practitioners. For some of the metrics, the tool employs machine learning and natural language processing. The metrics and the tool are evaluated on twelve chatbot designs. The tool could identify quality issues in terms of readability, conversation complexity, user experience and bot understanding. This demonstrates the usefulness of the tool in practice and how these metrics can help software developers in designing high-quality bots.

The metrics from the paper are:

  • INT – # intents
  • ENT – # user-defined entities
  • FLOW – # conversation entry points
  • PATH – # different conversation flow paths
  • CNF – # confusing phrases
  • SNT – # positive, neutral, negative output phrases
  • TPI – # training phrases per intent
  • WPTP – # words per training phrase
  • VPTP – # verbs per training phrase
  • PPTP – # parameters per training phrase
  • WPOP – # words per output phrase
  • VPOP – # verbs per output phrase
  • CPOP – # characters per output phrase
  • READ – reading time of the output phrases
  • LPE – # literals per entity
  • SPL – # synonyms per literal
  • WL – word length
  • FACT – # actions per flow
  • FPATH – # conversation flow paths
  • CL – conversation length

I will try to use these metrics if I write chatbot 🙂

What will shape the future of automotive software (engineering)?

Image by Jordan Holiday from Pixabay

Based on the following article + my own thoughts: D08042936-with-cover-page-v2.pdf (d1wqtxts1xzle7.cloudfront.net)

It’s been a while since I’ve written about automotive software, but that does not mean that nothing happened. During the pandemic, the car manufacturers suffered great losses caused by the global shortage of silicon, lack of workforce due to lockdowns and the overall slowdown of development due to the WFH situation.

There are a few trends that shape and will continue to shape the automotive sector. The first one is electrification – as the world is going away from fossil fuels, more cars will need to use electricity. For the software part, this means that there will be fewer components to steer the powertrain, fewer communication buses, and lower complexity. This means that we have some spare computing power for more advanced functionality.

Now, this advanced functionality can come from autonomous driving, which is still an important trend. However, it may also come from increased connectivity and an increased number of smart functions (the ones using machine learning). The increased ability to develop software that utilizes this new power will decide whether a given car is popular or not. By the end of the day, the consumers do not want to have boring cars with bare-minimum functionality. Cars are great, they need to be driven and their driving needs to be fun!

The last trend is the ability to utilize cooperative driving (which the article at the top tackles). To make things work smoothly, we need to coordinate. We can save fuel/energy if we calculate the exact time for one bus to arrive and the next one to leave – that requires coordination. The same goes for trucks, taxis, etc. This increased cooperative driving will also increase the complexity of software and put more requirements on the dependability of it – as one failure can propagate longer than before.

Do explicit review strategies improve code review performance?

Image by Pixabay

Do explicit review strategies improve code review performance? Towards understanding the role of cognitive load | SpringerLink

I’ve written a lot about code reviews and I’ve done my share of experimentation in software engineering. When I started my career, using inspections (like Fagan-style code inspections) was the primary source of experimentation. It was how I learned to experiment, although I never experimented with code inspections.

So, when getting my hands on this article, I thought that this is just one of the same, but in a different context – whether guided reading actually improves effectiveness and efficiency of code review processes. The effectiveness and efficiency are measured in the standard way – using defects as the output of the review process. But, there is something new with this study.

First of all, this is a study done with professional developers. The authors have designed an experiment and employed professional, though junior, developers to conduct it. Second of all, this is an experiment in the context of modern code reviews (Git, Gerrit, that sort of thing). Third, the results are not that convincing any more.

I encourage you to read the entire paper, but let’s dive a bit deeper into some of the results. For example, the experiment found that it is not always the case that guidance is better. It provides more cognitive load (the reviewers need to understand the guidance as well as the code), and it can be downright misleading. It pays off for longer and more complex code fragments.

The experiment also found that the complexity of the actual guidance (checklist) plays an important role – shorter, less cognitively demanding lists, are preferred. This is an important finding as, to my best knowledge, no one has ever said that. Checklists and perspective-based reading techniques assumed that more extra information equals better results. This experiment says that a well-balanced information is better than more information. I know, seems kind of obvious when you think of that, but it was not really considered up until now.

Finally, the most significant factor, found in this experiment, was that it is the understanding of the code that makes a review better or worse, not the guidance. At least not the guidance on a general level (like “Are all data types declared correctly?”).

What I make out of that is that there is nothing that substitutes knowledge. If you want to get something done, you need to put the hours into this.

I know, kids may not like it….