In the last post I’ve discussed the need to create good features from the data, and that this trumps the choice of the algorithm. Today, I’ve chosen to share my observations on the need of good data storage for machine learning. It’s a no brainer and everyone data scientist knows that this is important.
However, what the data scientists and machine learning specialists struggle with is which data to store and how?
Imagine a case when you develop a system that takes data from a Jenkins build system. It’s easy to collect the raw data from Jenkins using a REST API. You know how to do it, so you do not store the raw data – you just extract the features and dump the raw data. A week after you try to collect it again and the data is not there or is different, incomplete, manipulated. You just wanted to add one more feature to the data set, but you cannot, because the raw data is now available.
In our work with metrics and machine learning we realized that we need to store all data, raw data, featurized data, metrics and even decisions made on this data. Why, because of the traceability. All of that is caused by the constant evolution of software engineering.
First, we need to store the raw data as our feature extraction techniques evolve and thus we need to add new features. For example, a company adds a new field in Jenkins or uses a new tag when adding comments. We can use that information, but we probably need to change it for the entire data set.
Second, we need to store all intermediate metrics and decisions as we need to know whether the evolved data or evolved algorithms actually work better than the previous ones. Precision, recall and F1-scores are too coarse grained to actually understand if the improvement/deterioration is real or in the right/wrong direction.
Finally, we need to store the decisions as we need to know what we actually improve. We often store recommendations, but very seldom store decisions. We can use Online Experiment Systems (see publications by J. Bosch) in order to keep track of the results and of the decisions.
From my experience of working with companies, I see that keeping the raw data is not a problem, although it happens that it is neglected. I see that many companies neglect to store the decisions, so when an improvement is made, there is no real evidence that the improvement is real.