Stronger features vs. stronger algorithms in ML

I’ve been working with machine learning a bit during the last couple of years. I’ve had great teachers who showed me how to use the algorithms and where to start learning. Thanks to them I understood the importance of different elements of the ML tool chain – data, storage, algorithms, hardware.

I’ve worked on the problem of how to extract features of source code so that I can use them to predict if a specific line of code has a defect or not, in particular if the defect can be caught during code reviews. I’ve spent about a year on this problem and tested all kinds of combinations, from static code analysis to using word embedding, dictionaries and other NLP mechanisms to understand the code. Nothing really worked great. I got predictions that were a bit better than then chance.

What was the problem? Well, the problem was the quality of the input data. Since I extracted data, and features from this data, automatically from large code bases (often over 3 MLOC), I often encountered the following problems:

Labeling – I could not pinpoint exactly where the problem was, which meant that I needed to approximate the label, which led to the next problem,

Consistency – when one line was considered good by one person, it could be considered problematic by another one; this meant that I needed to decide how to treat lines that are “suspicious”, and

Scales – when extracting features, some of them were on scale of 1 to 100, whereas some other ones were on the scale from 1 to 3; this meant that I needed a good scaler to get the features right.

So, here I am, working on the next implementation of the feature discovery algorithm. The algorithm that can extract features in such a way that each objects has distinct characteristics, yet the number of features is as small as possible to characterize each object. The algorithm helped me to boost the accuracy of the classification from ca. 50% to over 96%.

I’ve discovered that using simple ML algorithms on a good data set trumps everything else. I used AdaBoost with scaling of features on the good data set, and that was at least twice as good as using LSTM models with word embeddings (which were not bad anyways) for the same purpose.

My advice, therefore, is the following:

Start with a simple classification/ML algorithm and do not go into neural networks or other advanced methods,

Learn your data and look at it from several angles; use business intelligence and statistics to understand the dependencies between features (PCA, t-SNE) and chew on the data as long as you can, and

Focus on extracting features from your data, rather than expecting magic from ML; no algorithm can trump good input data and no filtering can trump a good “featurizer”

 pixabay
Image source: pixabay

Author: Miroslaw Staron

I’m professor in Software Engineering at IT faculty. I usually blog about interesting articles (for me) and my own reflections on the development of Software Engineering, AI, computer science and automotive software.

0 thoughts on “Stronger features vs. stronger algorithms in ML”

  1. Что представляет собой техническое обслуживание кондиционеров?
    Сервисное обслуживание кондиционеров и сплит-систем включает в себя расширенный комплекс сервисных работ. http://www.klimatservice24.ru – поможем законсервировать кондиционер на холодное время года. Это поможет избежать незапланированного запуска кондиционера при критическом понижении температуры.

    Сервисное обслуживание кондиционеров необходимо производить как правило 2 раза в год – весной и осенью.

    https://klimatservice24.ruhttps://i.ibb.co/6P5WdYd/LA-800-vnutr-blok-3-4-right-zhaluzi-niz-01.png

    Техническое обслуживание климатического оборудования могут выполнять только хорошо обученные специалисты, имеющие опыт и знающие особенности строения всех марок кондиционеров, сплит-систем и т.п. Именно такие мастера работают в нашей компании.

    Обслуживание кондиционеров

    https://klimatservice24.ru/obsluzhivanie-konditsionerov-v-moskve/ – обслуживание кондиционера в москве, https://klimatservice24.ru/sezonnoe-obsluzhivanie-konditsionerov/ – сезонное обслуживание кондиционера
    – Произведём очистку внутреннего модуля механическим, химическим способом. Для полной очистки разберём проблемные узлы: вентилятор, ротор, фильтры, испаритель.
    – Вычистим поддон для сбора конденсата, дренажную помпу.
    – Выполним антибактериальную обработку всех узлов и систем.

    Если Вы часто пользуетесь сплит-системой, то заключим договор на регулярное техническое обслуживание на отличных условиях – https://klimatservice24.ru/obsluzhivanie-split-sistem/ – обслуживание сплит-систем.
    Не откладывайте сервисное обслуживание климатического оборудования, выполняйте его своевременно. При таком отношении кондиционер прослужит Вам несколько десятков лет.

    Сайт компании – https://klimatservice24.ru/obsluzhivanie-konditsionerov/ – обслуживание кондиционера, https://klimatservice24.ru/chistka-konditsionerov/ – чистка кондиционеров
    https://klimatservice24.ru/servisnoe-obsluzhivanie-konditsionerov/ – сервисное обслуживание кондиционера, https://klimatservice24.ru/stoimost-obsluzhivaniya-konditsionera/ – стоимость обслуживания кондиционера

Leave a Reply

Your email address will not be published. Required fields are marked *