Machine learning is hungry for data. The more you have, the happier it will be. Seems very easy when we learn how to program ML and how it works – there is plenty of open data sources to practice and learn from.
However, when we want to use ML for our purposes, things get a bit more complicated. There is a lot of data, but not in the right format. The one that is in the right format is incomplete. The one that is complete, is noisy. The one that is not noisy is too little. We need to collect more. And so the story goes on, and on, and on….
Collecting the data is not that problematic, as it can often be automated. At least in software engineering, automotive, telecon, transport/logistic and medicine. These are the ones I know, anyways. What is problematic, though is data labelling. It is the activity where we take each data point and add a class to it, or its label if we speak machine-learnish. The person doing the labelling needs to be competent to be able to label the data correctly – he/she needs to know the domain, know the data, know the context. Then, this person also needs to have a fantastic memory, because the labels need to be consistent. They also need to be unambiguous given the underlying feature vector.
In this paper, colleagues from our department study the process of data labelling and its challenges.
They find the following to be selected examples of challenges:
- Lack of a systematic approach to labeling data for specific features
- Unclear responsibility for labeling
- Noisy labels
- Difficulty to find a correlation between labels and features
- Skewed label distributions
- Time dependence
- Difficulty to predict future uses for datasets
I think it’s a great work and reading for everyone who wants to get into ML for real, start using it at a company and understand whether it’s actually gives any benefit.
From the abstract: Labeling is a cornerstone of supervised machine learning. However, in industrial applications, data is often not labeled, which complicates using this data for machine learning. Although there are well-established labeling techniques such as crowdsourcing, active learning, and semi-supervised learning, these still do not provide accurate and reliable labels for every machine learning use case in the industry. In this context, the industry still relies heavily on manually annotating and labeling their data. This study investigates the challenges that companies experience when annotating and labeling their data. We performed a case study using a semi-structured interview with data scientists at two companies to explore their problems when labeling and annotating their data. This paper provides two contributions. We identify industry challenges in the labeling process, and then we propose mitigation strategies for these challenges.