Data veracity is a concept where we define the degree to which data corresponds to the true values. It comes from the metrological concept of “measurement trueness”, which is the degree to which the measurement quantifies the value correctly.
Well, that sounds very simple, but it is in fact quite complex. In our previous work, we scrutinized what it means to have veracious data in transport systems (https://ieeexplore.ieee.org/abstract/document/7535482). It turns out that “lying” is not the only option here.
In this book, the author looks into the way how things can be untrue. Sometimes deliberately by lying, sometimes by mistake. Sometimes, as we learn in the last chapter (with Brazilian aardvark), a mistake can actually end up being accepted as truth over time.
I recommend the book as it is written in a fantastic manner, providing examples from the real world (e.g. the alleged drone sightings over Gatwick in 2018). It even goes a bit further and discusses the need of replication of studies and that we should get more funding for making the scientific results more solid and robust.
I’ve came across this article from Empirical Software Engineering and it cought my attention. It describes a study of how to identify where a bug was introduced.
The article accurately observes that the defects are fixed, most often, in a place where they were NOT introduced. So, the question is whether we can find where the defects were introduced.
Several studies focused on understanding which release/commit introduced a specific defect. This article describes how to find this particular release. It is based on a theoretical framework of perfect tests, i.e. tests which can capture defects in releases where they were introduced. The authors of this study evaluate four different algorithms on two different open source projects. Their findings show that it is possible, to some extent, find the right release where the bug was introduced. Knowing the release and knowing which changes were introduced into the release, it is possible to narrow down the piece of code that contains the bug.
Very interesting work and looking forward to more studies in this area, in particular in the area of proprietary software!
A lot of defect research is focused on either localization of defects or the prediction whether a defect will be found/fixed, etc. I’m guilty to adding to the state of the art in this area with a number of articles. It’s a great line of work, nice because we can play with data and get results that can actually be verified – we can check whether a defect is or is not there.
However, in many cases, the defect can be a mistake made in a number of places – so-called a multiple fault or multiple faults. Therefore, this article, freshly from the press of IST, caught my attention. It presents a systematic review of what has been done in that area.
Turns out, not that much, but the field has been gaining popularity in the past few years.
What I like, in particular, about this paper is the fact that it asks a question about which datasets exist (see Table 8 in the paper for the full reference). I can’t wait to take a closer look at these datasets – maybe something for me PhD course in metrics next year?
Testing of machine learning systems is a tricky business. Not only the algorithms are based on statistics, they are also very complex and they are highly dependent on the data that is used for training and validation. Yet, the algorithms are very important for our modern software systems and therefore we need to make sure that they work as they are intended to.
I’ve came across an article where authors reviewed literature on how machine learning systems are tested. A list of aspects that this paper looks into is:
What to test:
Test input generation
Test oracle generation
Test adequacy evaluation
Bug report analysis
Debug and repair
Where to test:
Data testing
Learning program testing
Framework testing
Test for what:
Correctness
Model Relevance
Robustness&Security
Efficiency
Fairness
Interpretability
Privacy
The list is quite impressing and so is the paper. For me, the most interesting category was the testing of data, which reviews challenges and also provides some solutions. For example, it lists frameworks which are used for testing of data: ActiveClean or BoostClean. These frameworks look at the data and try to capture how valuable the data is for the actual algorithm.
I’ve been working with machine learning a bit during the last couple of years. I’ve had great teachers who showed me how to use the algorithms and where to start learning. Thanks to them I understood the importance of different elements of the ML tool chain – data, storage, algorithms, hardware.
I’ve worked on the problem of how to extract features of source code so that I can use them to predict if a specific line of code has a defect or not, in particular if the defect can be caught during code reviews. I’ve spent about a year on this problem and tested all kinds of combinations, from static code analysis to using word embedding, dictionaries and other NLP mechanisms to understand the code. Nothing really worked great. I got predictions that were a bit better than then chance.
What was the problem? Well, the problem was the quality of the input data. Since I extracted data, and features from this data, automatically from large code bases (often over 3 MLOC), I often encountered the following problems:
Labeling – I could not pinpoint exactly where the problem was, which meant that I needed to approximate the label, which led to the next problem,
Consistency – when one line was considered good by one person, it could be considered problematic by another one; this meant that I needed to decide how to treat lines that are “suspicious”, and
Scales – when extracting features, some of them were on scale of 1 to 100, whereas some other ones were on the scale from 1 to 3; this meant that I needed a good scaler to get the features right.
So, here I am, working on the next implementation of the feature discovery algorithm. The algorithm that can extract features in such a way that each objects has distinct characteristics, yet the number of features is as small as possible to characterize each object. The algorithm helped me to boost the accuracy of the classification from ca. 50% to over 96%.
I’ve discovered that using simple ML algorithms on a good data set trumps everything else. I used AdaBoost with scaling of features on the good data set, and that was at least twice as good as using LSTM models with word embeddings (which were not bad anyways) for the same purpose.
My advice, therefore, is the following:
Start with a simple classification/ML algorithm and do not go into neural networks or other advanced methods,
Learn your data and look at it from several angles; use business intelligence and statistics to understand the dependencies between features (PCA, t-SNE) and chew on the data as long as you can, and
Focus on extracting features from your data, rather than expecting magic from ML; no algorithm can trump good input data and no filtering can trump a good “featurizer”
I’ve picked up this book to learn a bit about perseverance and the power of pursuing goals. I’ve hoped to see if there is something I could learn about it for my new year’s resolutions.
It turned out to be a great book about being humble and to get rejected. Let me explain. The concept of grit means that one has the guts to do something. The resilience to get rejected. The initiative to start working on the next steps regardless of the outcome, and, finally the tenacity – the ability to focus on the goals.
The last one is an important one for the new year’s resolutions, but the resilience is an interesting quality. One can go on autopilot for the mundane things, but still needs the resilience when things go wrong. Sounds a bit like academic careers. We plan studies, conduct them, try to publish, get rejected, improve the papers, try to publish, etc.
We also need to have initiative to move the field of our study forward. We need to come up with new project ideas, submit research proposals. Get rejected. Fine tune the proposals, resubmit somewhere else, and so on.
Finally, the guts is a big quality. Researchers need to have the guts to take on big problems, to plan and conduct the studies, to speak in front of large audiences. Yes, speaking is not something that comes easy to most of us. We still need to prepare and find what we want to say and how. We need to adjust the talks based on the audience, the message and the goal of the talk.
It’s a great book to get some motivation for the work after the vacations. Work hard, publish, apply for funding and work even harder. Amidst all of that, please remember that you need to have students with you and that they need your attention too!
In my current PhD class I teach my younger colleagues how to work with measurements. It may sound straightforward, but one of the challenges is the data collection. I’ve written about this in my previous posts.
So, here I propose to look at the video recording from the lecture about it.
In the recording, we discuss different types of measurement instruments and how they are related to the measurement concepts like measured entity, measured attribute, base measure and measurement method.
Once in a while I pick up a book about something outside of my expertise. As a kid I lived ca. 200 km from Chernobyl, where the biggest nuclear disaster happened in April 1986 (I actually remember that day and the few days after). The book got my interest because of its subtitle – the untold story of the nuclear disaster. I, admittedly, wanted to know how the disaster looked like from the side of the operators.
No one really knows what the long-term effects of the disaster really are (after all, 30+ years in not such a long term), but it’s interesting to see how the disaster happened and what we can learn from it in software engineering.
So, in short, the disaster happened because of the combination of factors.
First, the design of the reactor was flawed. The mix of substances used in the reactor have certain properties that raise the effect when they should lower it, or raise the effect when not monitored constantly.
Second, the implementation of the design, the construction of the power plant, was not great either. Materials of lower specs were used due to shortages in the USSR. The workers did not care much about the state property and the 5-year plans trumped the safety, security measures and even the common sense.
Third, and not less important, the operations were not according to the instructions. The operators did not follow the instructions for the test that they were about to commence. They reduced the power below the limit and then executed the test. Instead, they should have stopped the reactor and run the test during the next window available.
So, what does it have to do with software engineering? There was no software malfunction, but a set of human errors.
IMHO, this accident teaches us about the importance of safety mechanisms in software. I believe that many of us, who design software, do not think so much about the potential implications of what we do. We get a set of requirements, which we implement. However, what we should do, is to look broader at how users can use our system. How we can prevent any potential disaster.
For example, when we implement an app for a game. Should we allow people to play the game as much as they want? Should we provide them with all kinds of commercials? or should we help them by saying that they played long enough and that they could consider a break? Or maybe we should filter the commercials if we know that the game is played by a child?
I think that this is something we need to consider a bit more. We should even discuss this when we design our curricula and how we implement the curricula.
When discussing data-driven development and the use of data to identify new features and products, it is always the needs of the organization that come first. The companies design the system, design their organization’s needs and then experiments which will provide the organization with the data needed to validate the hypothesis.
). What the theories prescribed, back then, was that the organizations should only look at their goals and needs. What we discovered was that it was a combination – what the company needs and what it can actually measure. The reality of the organizations that we studied was that not all needs could be fulfilled by the data they had or by the data they could possible have.
The article did not have anything to do with the measurement programs, but it had a lot to do with the data. It’s content was about the global apps, but what caught my attention was the concept of providing the user with the feedback what he/she can do with the data, rather than what data is needed for the task.
Sounds a bit crazy, but I think that it’s an important step towards a real data-driven development. Imagine that instead of thinking about discussing what we should do and how to do it, we can take a look at the data and immediately know what we can do.
If we know directly what we can do with the data, then we can just do it (or not) rather than spend time to discuss whether we can or cannot do it.
What it also means is that we can think more about the product than thinking about the data. We can think about what which features can be developed or dropped from the product. We do not even need to design experiments, we can just observe the products in field.
Software engineering is an applied scientific area. It includes working with industrial applications and solving challenges that modern organizations face today.
Thanks to many of my colleagues, I’ve had the opportunity to work with industry-embedded research since I arrived here in Gothenburg. I want to share these experiences with colleagues and students, which led me to writing a book about action research.
Abstract:
This book addresses action research (AR), one of the main research methodologies used for academia-industry research collaborations. It elaborates on how to find the right research activities and how to distinguish them from non-significant ones. Further, it details how to glean lessons from the research results, no matter whether they are positive or negative. Lastly, it shows how companies can evolve and build talents while expanding their product portfolio.
The book’s structure is based on that of AR projects; it sequentially covers and discusses each phase of the project. Each chapter shares new insights into AR and provides the reader with a better understanding of how to apply it. In addition, each chapter includes a number of practical use cases or examples. Taken together, the chapters cover the entire software lifecycle: from problem diagnosis to project (or action) planning and execution, to documenting and disseminating results, including validity assessments for AR studies.
The goal of this book is to help everyone interested in industry-academia collaborations to conduct joint research. It is for students of software engineering who need to learn about how to set up an evaluation, how to run a project, and how to document the results. It is for all academics who aren’t afraid to step out of their comfort zone and enter industry. It is for industrial researchers who know that they want to do more than just develop software blindly. And finally, it is for stakeholders who want to learn how to manage industrial research projects and how to set up guidelines for their own role and expectations.