I’ve been working with machine learning a bit during the last couple of years. I’ve had great teachers who showed me how to use the algorithms and where to start learning. Thanks to them I understood the importance of different elements of the ML tool chain – data, storage, algorithms, hardware.
I’ve worked on the problem of how to extract features of source code so that I can use them to predict if a specific line of code has a defect or not, in particular if the defect can be caught during code reviews. I’ve spent about a year on this problem and tested all kinds of combinations, from static code analysis to using word embedding, dictionaries and other NLP mechanisms to understand the code. Nothing really worked great. I got predictions that were a bit better than then chance.
What was the problem? Well, the problem was the quality of the input data. Since I extracted data, and features from this data, automatically from large code bases (often over 3 MLOC), I often encountered the following problems:
Labeling – I could not pinpoint exactly where the problem was, which meant that I needed to approximate the label, which led to the next problem,
Consistency – when one line was considered good by one person, it could be considered problematic by another one; this meant that I needed to decide how to treat lines that are “suspicious”, and
Scales – when extracting features, some of them were on scale of 1 to 100, whereas some other ones were on the scale from 1 to 3; this meant that I needed a good scaler to get the features right.
So, here I am, working on the next implementation of the feature discovery algorithm. The algorithm that can extract features in such a way that each objects has distinct characteristics, yet the number of features is as small as possible to characterize each object. The algorithm helped me to boost the accuracy of the classification from ca. 50% to over 96%.
I’ve discovered that using simple ML algorithms on a good data set trumps everything else. I used AdaBoost with scaling of features on the good data set, and that was at least twice as good as using LSTM models with word embeddings (which were not bad anyways) for the same purpose.
My advice, therefore, is the following:
Start with a simple classification/ML algorithm and do not go into neural networks or other advanced methods,
Learn your data and look at it from several angles; use business intelligence and statistics to understand the dependencies between features (PCA, t-SNE) and chew on the data as long as you can, and
Focus on extracting features from your data, rather than expecting magic from ML; no algorithm can trump good input data and no filtering can trump a good “featurizer”