2020 was the year like no other. Everyone can agree with that. The pandemic changed our lives a lot – the pace of digitalization has gone from tortoise to a Space-X rocket!
For me, this year has also changed a lot of things. I’ve moved into new field of medical signal analysis using ML. I realized that my skillset can be used to help people. Maybe not the ones that were hit by the pandemic, but still people who need our help.
Together with a team of great specialists from the Sahlgrenska university hospital, we managed to create a set-up of collecting data in the operation room, tagging them and then, finally using ML.
In the last three months, we managed to move from 0 to having three articles in the making, collecting data from several patients, fantastic accuracy and a great deal of fun.
I’ve reflected upon this project and it’s probably the project where I had the most fun during 2020. It’s a completely new set-up, great team, extreme energy in the work and a great deal of meaning behind it.
The project was partially sponsored by Chalmers CHAIR initiative. Thank you!
Machine learning is hungry for data. The more you have, the happier it will be. Seems very easy when we learn how to program ML and how it works – there is plenty of open data sources to practice and learn from.
However, when we want to use ML for our purposes, things get a bit more complicated. There is a lot of data, but not in the right format. The one that is in the right format is incomplete. The one that is complete, is noisy. The one that is not noisy is too little. We need to collect more. And so the story goes on, and on, and on….
Collecting the data is not that problematic, as it can often be automated. At least in software engineering, automotive, telecon, transport/logistic and medicine. These are the ones I know, anyways. What is problematic, though is data labelling. It is the activity where we take each data point and add a class to it, or its label if we speak machine-learnish. The person doing the labelling needs to be competent to be able to label the data correctly – he/she needs to know the domain, know the data, know the context. Then, this person also needs to have a fantastic memory, because the labels need to be consistent. They also need to be unambiguous given the underlying feature vector.
In this paper, colleagues from our department study the process of data labelling and its challenges.
They find the following to be selected examples of challenges:
Lack of a systematic approach to labeling data for specific features
Unclear responsibility for labeling
Difficulty to find a correlation between labels and features
Skewed label distributions
Difficulty to predict future uses for datasets
I think it’s a great work and reading for everyone who wants to get into ML for real, start using it at a company and understand whether it’s actually gives any benefit.
From the abstract: Labeling is a cornerstone of supervised machine learning. However, in industrial applications, data is often not labeled, which complicates using this data for machine learning. Although there are well-established labeling techniques such as crowdsourcing, active learning, and semi-supervised learning, these still do not provide accurate and reliable labels for every machine learning use case in the industry. In this context, the industry still relies heavily on manually annotating and labeling their data. This study investigates the challenges that companies experience when annotating and labeling their data. We performed a case study using a semi-structured interview with data scientists at two companies to explore their problems when labeling and annotating their data. This paper provides two contributions. We identify industry challenges in the labeling process, and then we propose mitigation strategies for these challenges.