Testing of ML systems

BIld av OpenClipart-Vectors från Pixabay

Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

Machine learning systems are very popular today, at least when it comes to research applications. They are not as popular as one would wished (or liked) in the real applications. One of the reasons is the fact that they are hard to test. We do not know how to check if an algorithm will behave as expected in all similar situations – well, we do not know which situations are similar for us and for the ML system.

This paper looks at the problem from a different angle. The research question is: RQ: What are simple and generic software tests that are capable of finding bugs and improving the quality of machine learning algorithms?

The authors developed a set of smoke tests, which they see that all ML algorithms should pass. The paper is quite exhaustive and if you are interested, I recommend to take a look at this table:

Table 1 | Smoke testing for machine learning: simple tests to discover severe bugs | SpringerLink

I love the article. It is simple, to the point and very applied. I’m going to use that in my tests of ML algorithms in the future.

How good are language models for source code tasks?

https://ieeexplore-ieee-org.ezproxy.ub.gu.se/document/9653849

Using machine learning, and deep learning in particular, for software engineering tasks exploded recently. I would say that it exploded a bit too much. I’m myself to blame here as our team was one of the early adopters with the CCFlex model and source code analysis.

Well, this paper compares a number of modern deep learning models, so called transformers, in various code and comment analysis tasks. The authors did a great job in collecting a set of models and datasets, trained them and critically evaluated the performance.

I recommend reading the entire paper, but what they found was a bit surprised for me. First of all, they found that the transformer models are better for the natural language and not so great for the source code analysis. The hypothesis is that the structure of programs is important here. They have also found that pre-training is important, but not crucial. Pre-training attributes to a moderate effect in the end. The dataset, and its content, is much more important for the task at hand.

This is a great paper and I hope that this can become an essential reading for software engineers working with AI systems engineering supporting the software engineering tasks.

Review4Repair – article review

https://www.sciencedirect.com/science/article/pii/S0950584921002111?casa_token=D32qAALSwPQAAAAA:2Vm5zjUzFncLhQ_eZTUueefRqllb8fwBEnfWJfbNO_TYMcIjuumTBL9wWCLkscBko53FPR5cRQ

Reviewing source code is something that I talked about a few times already. It is an activity which is almost as old as software engineering itself. Back in the beginning, this activity was done manually just before the release, as a complement to testing.

Then came google with their software engineering practices, something that is known today as “shift left”, meaning that software quality assurance activities should be done close to when the source code is actually developed. Then came Microsoft with their “Modern Code Reviews” that advocated code reviews before actually committing the source code to the main branch.

Now, we are pretty good at reviewing source code. It is an activity which is done rather fast. As the amount of data from code reviews grows, we are getting more eager to try to use AI and ML for this task. This article is a very good example of that. The authors leveraged on the ability to use seq2seq code summarization techniques to match source code with comments and then with code repair suggestions. The results are promising and show that they are able to provide a relevant suggestion in 1 out of 5 cases. One out of three if we consider top 10 suggestions.

I’ve read this article with huge interest and I will try this myself. All data and code is available publicly, which allows to play around with this technology on your own system. Spoiler alert, though, that training takes one week per pass on a TPU.

Reviewing rounds prediction for code patches

Image by pixabay

Reviewing rounds prediction for code patches | SpringerLink

Understanding how reviews of source code are done seem to be one of my main interests recently. Partly because reviews are important for software quality, while taking time. Partly, also, because I think it is interesting to check if we can quantify a good opinion from a good software developer.

In this paper, the authors study how to predict to which degree one can predict how many comments a given patch will have. Now, this problem may not be the most exciting ones, but it attracted my attention because of the fact that the authors studied the same projects as we did. However, to the contrary of our work, they also take into consideration features that characterize software developer networks – for example the experience in commenting software patches or the networking.

Now, to the results. The models presented in this paper seem to be quite good in quantifying and predicting patches within the same project – all kinds of predictions have pretty good F1-scores, above 65%. This means that we can train these models for our own projects and to be able to predict whether a particular patch will be commented on once or twice, or even many times.

The performance of the model on the cross-projects dataset. There, the performance is ok for predicting whether a particular patch will be commented on. Predicting how many times the patch will be commented on, or even that it will be commented on many times, does not work very well. The magnitude of the performance measures oscillates close to the 0% mark, which means that the models are not better than just guessing. I guess, you cannot have it all… from one model.

To sum up, the reason I read this article in more detail than others, was essentially not the performance, but tryin to understand the underlying techniques which they use. I’d like to say that they use a good set of features, which I recommend other to use (and will definitely use myself in the next studies), and the fact that they use simple language models, like word2vec, to understand the programming language. What I lack, though, is scrutinizing whether there is a statistically significant dependency between the sentiment (or even its strength) and the length of the discussion.

Rationality by Steven Pinker

Rationality: What It Is, Why It Seems Scarce, Why It Matters : Pinker, Steven: Amazon.se: Böcker

I’ve read this book because I wanted to get some inspiration from social sciences. What I ended up with is a book about Bayesian statistics and its consequences. To some extent, I’m happy to have read it, because I got a better view of how to use Bayesian statistics in practice. Yet, I am a bit disappointed. Not about the statistics, but about the book.

A while back I read the book by Judea Pearl about Bayesian networks and this role in statistics. That book was about something new and a bit fresh. It got my interest and I felt refreshed after I read it. This book was a bit too much of a repetition. Don’t get me wrong, I like repetition and I like the style of Steven Pinker (he is one of my role models when it comes to academic writing).

The book is about making inferences by calculating probabilities using Bayes theorem. It explain them very well and shows how they are used and misused in modern society – from politics to medical diagnoses. The book shows a few tricks to make better decisions and inferences.

I guess I need to read it once more in order to get a better taste of its distinct flavor.

A Thousands Brain theory, book review

Image by Pixabay

A Thousand Brains: A New Theory of Intelligence : Hawkins, Jeff: Amazon.se: Böcker

This is a new book, written by one of the people behind Palm Pilot, who is both an engineer and a neuroscientist. The book proposes a theory on how to describe neocortex and its functions.

Why is neocortex so important, one may ask. It holds our intelligence and our consciousness. Some would say that it is the place which defines us as humans, which allows us to be aware and intelligent.

The interesting part of this book is the fact that it attempts to provide guidelines on the future of machine intelligence and machine learning. It shows different paths to achieve AGI (Artificial General Intelligence): either as developing a lot of specialized models and winding them together, or as making one large model for everything.

These two approaches are already present in the modern AI community. The latter one (large model for everything) can be seen in the work of OpenAI and GPT-3. The scientists behind that model train it on large corpora of text, hoping that it can understand our natural language and execute our commands. Well, for now it is mostly about creating programs.

The first approach is generally the original idea of AI and ML. The original idea is about training models for specific tasks, such as image recognition, classification, text translation. This is where most of the current research lies and where we have observed the latest breakthroughs – AlphaGo, AlphaStar.

However, the thesis in the book is that the approach of one large model is more natural, similar to how our brain works. The theory of how neocortex stores, frames and recalls information is the core of what we need to achieve in order to make it work in practice.

Well, there is more to the book than I can write in a blog post, so I strongly recommend to read it and reflect on how we use ML and AI today. I’m going to try few of these ideas in 2022!

Death interrupted by Jose Saramago, book review

Only for those who love Terry Pratchett….

Every now and then I read what the Nobel prize winners in literature write. This year, I did read this novel by 1998 winner.

The novel is quite interesting and very well written. If you like Terry Pratchett, then you will find it very good. The synopsis is that one day, Death decides to take a break and do not visit England. It means that no one in England dies that day, and for a while onwards.

At first, this situation brings a lot of joy – people do not die in car accidents, from diseases, etc. After a while, however, this situation brings a few challenges.

First, the challenges for the healthcare – there is not enough room in hospitals anymore, as the very sick patients are in the wards for ever.

Then, the economical – the funeral entrepreneurs are getting out of business.

Finally, the mafia comes in and starts smuggling people to neighboring countries, where they instantly die. For a small fee, naturally.

Now, in the middle of the book, the death returns, but changes the way which she works. Instead of using the scythe to take lives, she uses it to send letters. The letters, nicely put in a violet envelope, have the same effect – the recipient dies.

Again, this sounds like a great idea, but depends on the post office (of course) and the timing of sending the letters, as well as the timing for receiving the letters.

The book ends with Death coming down to earth dressed as a woman, as she fell for a cellist. She watched him perform, visit him at his home, keep his pet on her laps.

I’m not going to spoil the book for you, so here is where I stop.

If you like the style of Terry Pratchett, this book is for you!

Legacy code…

I stumbled across a great talk from Dylan Beattie about legacy code. It is a pre-pandemic talk, but it opens up with a great song and talks about legacy code differently than what we usually do.

There is a lot of great material and food for thought in this video, but I would like to turn your attention to minute 26, where Dylan talks about Excel and how the world runs on it.

He says that a lot of things are actually built on top of Excel because it is essentially a functional language of sorts. The software developed on top of Excel is also the software that is NOT written by professional programmers and software engineers. Yet, it is prevalent in modern society.

Don’t get me wrong. I am in favor of Excel. Love the tool and what Microsoft has done with it. It is so flexible that it can be used with almost all programming environments – from the built-in VBA (I know, ancient history), to Python or C#. We’ve done our share of Excel programming back in a day, e.g. designed measurement systems based on it: A framework for developing measurement systems and its industrial evaluation – ScienceDirect

I agree, the tool is not perfect, but it is installed on ALL office computers and can be executed by anybody. Just open up the file and run it. That’s why we chose it for the measurement systems. Well, at least until we had to do a big rewrite and go to SQL, dashboards, etc…

As I said – history.

Predicting defects on the line level, article review

Image by pixabay

IEEE Xplore Full-Text PDF:

A lot has been written about defect prediction, and I’m pretty sure that a lot will be written. It’s one of the research areas which is quite cool to work with because it provides researchers with quite quick results and is relatively quantitative in its nature.

One could also say that this is a holy grail in software development – to predict a location of a defect and fix it before it becomes a problem. It’s a good goal, but it is also a goal that is more like quicksand than a gravel road. Well, for one, not all defects are easy to recognize. Some are not even certain to be defects – sometimes it is not clear how to interpret a requirement, so it’s not easy to say if a piece of code is implementing it correctly or not.

In this paper, the authors have done a great job in creating a system to predict defect location on line-level – DeepLineDP. The requirements for the system are partially based on a survey conducted by the authors with developers.

According to the authors: “DeepLineDP is 14%-24% more accurate than other file-level defect prediction approaches; is 50%-250% more cost-effective than other line-level defect prediction approaches; and achieves a reasonable performance when transferred to other software projects. These findings confirm that the surrounding tokens and surrounding lines should be considered to identify the fine-grained locations of defective files (i.e., defective lines). “

I like this work and I recommend everyone interested in how to use deep learning for code tasks to look at this work.

Our team has done some of these investigations ourselves. You can watch them on Youtube here:


Post-Corona branding…

Post Corona: From Crisis to Opportunity: Galloway, Scott: 9780593332214: Amazon.com: Books

A good holiday reading is something that is an essential addition to the time spent with family and friends. Every year I try to get hold of a good book to get inspiration for the upcoming year. Last year, I’ve read “Grit”, which is about perseverance. This year, I noticed a book of NYU Stern professor Scott Galloway – “Post Corona: From Crisis to Opportunity”. Galloway is also the author of “The Four”, which is a book about Apple, Google, Facebook, and Amazon.

Now, to the topic of the day – the post-corona book. I’ve read this book a bit slower than I usually do (which is a good thing). When I read it I had Galloway’s voice in my head talking about the opportunities of large companies – a big tech, as he calls them. His thesis is that the pandemic actually accelerated their growth to the size which makes them really hard to disrupt. By mergers and acquisitions, they can “cannibalize” their competition, unless the competition is them.

I thought that this book would be about Zoom – a company unheard of before the pandemic, now a synonym for a phone call. I thought that the book would be about health services and telemedicine – another area that was small and now is big. Now, it was nothing like that. The book was about The Four and how they capitalize on their brands in times of pandemic.

There is a thesis out there, that if you are getting something for free, it’s not worth much. In this book, Galloway popularizes another thesis – if you are getting something for free, you are the product, not the consumer. He uses this as a way of explaining why Apple charges so much for their products – for not using our data, whereas Google and Facebook/Meta capitalize on our data. Apple connects 200 data points per day from us, while Google collects 2000 data points per hour – a small difference.

I’m not a privacy freak, but I do not want to be a product unless I choose to. I do not want companies to monetize on me, my behavior, and my family. But, and that’s a sad thing, I do want to have great services for a reasonable price. I want my maps to work well – the one in the car’s GPS simply does not make it. I want to watch short tutorials on YouTube – Netflix does not produce tutorials about variational autoencoders (yet).

To sum up, I like the thesis posed by Galloway, that the next big thing taken up by Amazon, will probably be medical insurance or schools. It is not difficult to see that the telemedicine model is essentially mature enough for being disrupted. I really recommend this book as food for thought in the post-pandemic (or endemic) world of 2022.