Testing of software systems is a task which costs a lot. As a former tester, I see this as a neverending story – you’re done with your testing, new code is added, you are not done anymore, you test more, you’re done, new code…. and so on.
When I was a tester, there was no tools for automating the test process (we’re talking 1990’s here). Well, ok, there was CppUnit, and it was a great help – I could create a suite and execute it. Then I needed to add new test cases, create functional tests, etc. It was fun, until it wasn’t anymore.
I would have given a lot for having tools for test orchestration back then. A lot of things happened since then. This paper presents a great overview of how testing cost is estimated – I know, it’s not orchestration, but hear me out. I like this paper, because it shows anyways which tools are used, how test cost is estimated (e.g. based on metrics like coverage, effort, etc) and how the tests are evaluated.
I recommend this reading as an overview, a starting point for understanding the testing processes today, and, eventually, to optimize the test processes based on the right premises (not HiPPO).
Many of companies talk about using AI in their software engineering processes. However, they have problems in sharing their data with researchers and students. The legal processes with open sourcing data were and are scary. The processes of setting up internal collaborations is time consuming and therefore it needs more effort.
So, this is a great example of replicating some industrial set-ups in the open source community. I’ll use these data sets in my work and I’d love to see more initiatives like that.
Our team is working on one of those at the moment…
2020 was the year like no other. Everyone can agree with that. The pandemic changed our lives a lot – the pace of digitalization has gone from tortoise to a Space-X rocket!
For me, this year has also changed a lot of things. I’ve moved into new field of medical signal analysis using ML. I realized that my skillset can be used to help people. Maybe not the ones that were hit by the pandemic, but still people who need our help.
Together with a team of great specialists from the Sahlgrenska university hospital, we managed to create a set-up of collecting data in the operation room, tagging them and then, finally using ML.
In the last three months, we managed to move from 0 to having three articles in the making, collecting data from several patients, fantastic accuracy and a great deal of fun.
I’ve reflected upon this project and it’s probably the project where I had the most fun during 2020. It’s a completely new set-up, great team, extreme energy in the work and a great deal of meaning behind it.
The project was partially sponsored by Chalmers CHAIR initiative. Thank you!
Machine learning and deep learning are only as good as the data used to train them. However, even the best data sources can lead to data of non-optimal quality. Noise is one of the exampes of the data problems.
Our research team has studied the impact of noise on machine learning in software engineering – mostly on the testing data. In this paper we present one techniques to identify noise, measure it and reduce it. There are several techniques to do it, but we use one of the more robust ones – removal of noise.
I recommend to take a look at how the algorithms work and let us know if you find it interesting!
Code smells are quite interesting phenomena to study. They are not really defects, but they are not good code either. They exist, but people rarely want to admit to them. There is also no consensus to how much effort it takes to remove them (or even whether they should be removed or just avoided).
In this paper, the authors study whether it is possible to use ML to find code smells. It turns out it is possible and the accuracy is quite high (over 95%). It also shows that sometimes it is better to show a number of recommendations (e.g. two potential smells) rather than one – it requires less accuracy to make the recommendation, but helps the users to narrow-down their solution spaces.
Data veracity is a concept where we define the degree to which data corresponds to the true values. It comes from the metrological concept of “measurement trueness”, which is the degree to which the measurement quantifies the value correctly.
Well, that sounds very simple, but it is in fact quite complex. In our previous work, we scrutinized what it means to have veracious data in transport systems (https://ieeexplore.ieee.org/abstract/document/7535482). It turns out that “lying” is not the only option here.
In this book, the author looks into the way how things can be untrue. Sometimes deliberately by lying, sometimes by mistake. Sometimes, as we learn in the last chapter (with Brazilian aardvark), a mistake can actually end up being accepted as truth over time.
I recommend the book as it is written in a fantastic manner, providing examples from the real world (e.g. the alleged drone sightings over Gatwick in 2018). It even goes a bit further and discusses the need of replication of studies and that we should get more funding for making the scientific results more solid and robust.
I’ve came across this article from Empirical Software Engineering and it cought my attention. It describes a study of how to identify where a bug was introduced.
The article accurately observes that the defects are fixed, most often, in a place where they were NOT introduced. So, the question is whether we can find where the defects were introduced.
Several studies focused on understanding which release/commit introduced a specific defect. This article describes how to find this particular release. It is based on a theoretical framework of perfect tests, i.e. tests which can capture defects in releases where they were introduced. The authors of this study evaluate four different algorithms on two different open source projects. Their findings show that it is possible, to some extent, find the right release where the bug was introduced. Knowing the release and knowing which changes were introduced into the release, it is possible to narrow down the piece of code that contains the bug.
Very interesting work and looking forward to more studies in this area, in particular in the area of proprietary software!
A lot of defect research is focused on either localization of defects or the prediction whether a defect will be found/fixed, etc. I’m guilty to adding to the state of the art in this area with a number of articles. It’s a great line of work, nice because we can play with data and get results that can actually be verified – we can check whether a defect is or is not there.
However, in many cases, the defect can be a mistake made in a number of places – so-called a multiple fault or multiple faults. Therefore, this article, freshly from the press of IST, caught my attention. It presents a systematic review of what has been done in that area.
Turns out, not that much, but the field has been gaining popularity in the past few years.
What I like, in particular, about this paper is the fact that it asks a question about which datasets exist (see Table 8 in the paper for the full reference). I can’t wait to take a closer look at these datasets – maybe something for me PhD course in metrics next year?
Testing of machine learning systems is a tricky business. Not only the algorithms are based on statistics, they are also very complex and they are highly dependent on the data that is used for training and validation. Yet, the algorithms are very important for our modern software systems and therefore we need to make sure that they work as they are intended to.
I’ve came across an article where authors reviewed literature on how machine learning systems are tested. A list of aspects that this paper looks into is:
What to test:
Test input generation
Test oracle generation
Test adequacy evaluation
Bug report analysis
Debug and repair
Where to test:
Learning program testing
Test for what:
The list is quite impressing and so is the paper. For me, the most interesting category was the testing of data, which reviews challenges and also provides some solutions. For example, it lists frameworks which are used for testing of data: ActiveClean or BoostClean. These frameworks look at the data and try to capture how valuable the data is for the actual algorithm.