Test case selection and prioritization using machine learning: a systematic literature review (springer.com)
Testing is an important activity in every software engineering project. In professional organizations, the process is structured and well-organized. In smaller projects, start-up style organizations, or in research studies, the process is less organized.
There are different views on why we do testing. Some think that we do testing to find defects, some to prove that the software works correctly, finally some think that we do this to waste time (well, not so many maybe). In my experience it is the combination of the first and the second. We do testing to find defects and also to track how good our software gets over time (software reliability growth modelling).
This paper presents a systematic literature review on using machine learning to select and prioritize test cases. I think that the authors summarize their contribution in a very good way (quote):
- The main ML techniques used for TSP are: supervised learning (ranking models), unsupervised learning (clustering), reinforcement learning, and natural language processing.
- ML-based TSP techniques mainly rely on features that are easy to compute and based on data that are practical to collect in a CI context, including execution history, coverage information, code complexity, and textual data.
- ML-based TSP techniques are evaluated using a variety of metrics that are, sometimes, calculated differently in TS and TP, making it difficult to compare their results. Most of the currently available subjects have extremely low failure rates, making them unsuitable for evaluating ML-based TSP techniques.
- Comparing the performance of ML-based TSP techniques is challenging due to the variation of evaluation metrics, test suite sizes, and failure rates across studies. Reporting failure rates alongside performance values helps provide more interpretable results to the wider research community.
- Only six out of the 29 selected studies (21%) can be considered reproducible, thus raising methodological issues in the studies and a lack of confidence in reported results.
I think the biggest surprise, for me, is that complexity-based metrics are still used widely in this context. I’m happy that there are new approaches on the rise, for example textual analyses. I guess there is a point in combining approaches, but complexity seems like a very coarse-grained instrument for this type of analysis. We know it correlates well with size, and the larger the test (or UUT), the higher the probability of triggering a failure.
Well, I guess I need to make more experiments myself to check if I miss something.