Being a software engineer working with AI for a while, I noticed that the engineering of AI systems is different. Well, maybe not building the actual system, but the way in which the knowledge about quality, testing and maintenance differ.
In this article, IEEE Software’s Editor in Chief presents her view on the topic. The main point is that this engineering is both similar and different. This quote from the paper summarizes it nicely: “I argue that our existing design techniques will not only help us make progress in understanding how to design, deploy, and sustain the structure and behavior of AI-enabled systems, but they are also essential starting points. I suggest that what is different in AI engineering is, in essence, the quality attributes for which we need to design and analyze, not necessarily the design and engineering techniques we rely on. “
One of the differences is the process of development. It is not aligned with the non-ML systems, e.g. in terms of training, testing, maintenance. ML systems are data-centric and this needs to be reflected in the AI engineering processes.
Ipek Ozkaya discusses the following misconceptions about the differences:
We can specify systems – both AI and non-AI systems cannot really be fully specified,
System correctness can be verified – we can never fully verify systems, neither AI-based on non-AI based (e.g. due to complexity),
We can avoid hidden dependencies,
We can manage system change propagation,
Frameworks do it all,
We can build reliable systems from unreliable and unpredictable subcomponents
I recommend this article to get a quick overview of the gist of the differences and misconceptions.
Engineering machine learning systems is much more than train-evaluate cycles. It means that we need to systematically integrate these ML systems with the rest of the component. We need to build safety-cages to ensure that the decisions are not out-of-bounds and we need to make sure that we can maintain these systems.
In this paper, the authors studied an example of automated driving vehicles, not fully autonomous (but still) and shown the challenges that we need to solve before AI and ML becomes one of our “fellow drivers” on the roads.
The findings of the paper show that it’s not going to happen soon. As the authors say in the abstract: “Our results show that machine learning models are characterized by a lack of requirements specification, lack of design specification, lack of interpretability, and lack of robustness. We also perform a gap analysis on a conventional system quality standard SQuaRE with the characteristics of machine learning models to study quality models for machine learning systems. We find that a lack of requirements specification and lack of robustness have the greatest impact on conventional quality models. “
The authors provide a process for machine learning models as part of safety critical software, where the designing of the system and its real-scenario validation are a bit more apart than traditionally.
The paper reviews the comments of developers who comment and/or post questions about three deep learning frameworks: Theano, Tensorflow and PyTorch. I’ve got interested in the paper because I wanted to see whether the communities using these frameworks differ. Myself, I’ve got introduced to Tensorflow a while back and keps using it. Since I’m not an ML researcher, the framework does not really matter for me, but I still would like to know whether I should read upon some new framework during the summer.
The observations quoted from the abstract:
1) a wide range of topics that are discussed about the three deep learning frameworks on both platforms, and the most popular workflow stages are Model Training and Preliminary Preparation.
2) the topic distributions at the workflow level and topic category level on Tensorflow and PyTorch are always similar while the topic distribution pattern on Theano is quite different. In addition, the topic trends at the workflow level and topic category level of the three deep learning frameworks are quite different.
3) the topics at the workflow level show different trends across the two platforms. e.g., the trend of the Preliminary Preparation stage topic on Stack Overflow comes to be relatively stable after 2016, while the trend of it on GitHub shows a stronger upward trend after 2016.
It’s interesting that the topics are roughly the same, but I’m a bit surprised that the topics are mostly about the data management/machine learning and not the frameworks themselves. This means that applications win over development of the frameworks – at least at the moment.
New sub-areas or fields within software engineering are not that common, but they come up once in a while. The authors of this article (https://doi-org.ezproxy.ub.gu.se/10.1007/s10664-020-09808-9, Better software analytics via “DUO”: Data mining algorithms using/used-by optimizers) argue that this is the case now.
In this article, the authors provide a view that data mining and building optimization models are done in tandem and that this is the new field. They show that the data mined from repositories influences optimization models and that the development of models influences data mining.
The authors make the following claims (quoted from the paper, references removed):
Claim1:For software engineering tasks, optimization and data mining are very similar. Hence, it is natural and simple to combine the two methods.
Claim2:For software engineering tasks. optimizers can greatly improve data miners. A data miner’s default tuners can lead to sub-optimal performance. Automatic optimizers can find tunings that dramatically improve that performance.
Claim3:For software engineering tasks, data miners can greatly improve optimization. If a data miner groups together related items, an optimizer can explore and report conclusions that are general across a set of solutions. Further, optimization for SE problems can be very slow. But if that optimization executes over the groupings found by a data miner, that inference can terminate orders of magnitude faster.
Claim4:For software engineering tasks, data mining without optimization is not recommended. Conclusions reached from an unoptimized data miner can be changed, sometimes even dramatically improved, by running the same tuned learner on the same data. Researchers in data mining should, therefore, consider adding an optimization step to their analysis.
These claims make a lot of sense and they are aligned with my observations. I recommend this article for everyone who is working at or developing a metric team or a data analysis/data science team.
What is interesting about this paper is that it presents a framework for testing ML applications. I’ve not tried it yet, but I will as it seems very interesting to check how things work with this metamorphic testing and metamorphic relations. I’ve also interested in how to measure the quality of the software in this context.
This is a great paper demonstrating the use of NLP techniques for completion of software source code. It uses recurrent networks and can reduce the size of the vocabulary compared to previous approaches.
As the authors say: “The CodeGRU introduces a novel approach which can correctly capture the source code context by leveraging the token type information.”
I like the approach because it can extract the information that is important for the analysis of source code – what kind of token is analysed and how it is used.
Conclusions (quote from the abstract): “Our experiment confirms that the source code’s contextual information can be vital and can help improve the software language models. The extensive evaluation of CodeGRU shows that it outperforms the state-of-the-art models. The results further suggest that the proposed approach can help reduce the vocabulary size and is of practical use for software developers.”
I’m kind of keen to check this approach in our work. See if we can use this to improve the quality of source code.
Working with software requirements and metrics is an important part of research in modern software companies. Although many of the companies are Agile or post-Agile, claiming that they do not have requirements, they still capture user needs in textual forms. For example, they describe user stories, epic, use cases.
This paper is an interesting view on the software requirements quality assessment. Instead of just calculating metrics and creating quality models, they use machine learning to mimic the way in which experts judge what is a good requirement and what is not. They use quality functions, and several of them, to distinguish between the good and bad requirements. Using multiple functions, in a multidimensional space, allows to select groups of requirements that are separated by the other class – the figures in the paper show more how this works in practice.
The summary of the gist of the paper is actually presented best in the introduction (quote): “Summing up, we can compute a set of quantitative metrics of textual requirements, and through them, we can assess the quality of requirements. However, the risk of this approach is to build assessment methods and tools that are both arbitrary in the parameterization of metrics and rigid in the combination of metrics to evaluate the different properties. This is why we propose in this work to develop a flexible assessment method that can be adapted to different contexts, with a high degree of automation. The method consists basically in the emulation of the experts’ judgment on quality through artificial intelligence techniques: first, obtain the expert’s implicit quality function through machine learning, and, second, apply this function to automatically assess the quality of textual requirements.
Our approach to emulate the experts’ judgment, as explained later in detail, is based on well-known machine learning techniques: we have a computer tool learn from a previous human-made classification of requirements according to their quality. Therefore, our work’s intent is not to improve machine learning techniques, but rather to devise a novel application to the field of requirements quality assessment.”
I strongly recommend to read the paper as it provides very good methods to work with requirements quality in many modern organisations.
Creating recommendation systems is a tricky task. We need to add the temporal domain to the data. In particular, we need to make sure that we capture what was recommended before to the specific user and how the user reacted upon that. We also need to capture the evolution of the users and the data.
In this paper, the authors present a framework, RectoLibry, which helps to construct these kind of systems. The system captures both the parts of the development of the recommendations, but also their deployment.
Our research team has been working with code comments for a while now. We have done analyses of source code comments and the code that is commented. We have also worked with the needs for code comments.
The results showed that AI support for code reviewing process is very much needed. However, it has also shown that the current tools are not good enough yet. One of the tools is DeepCode.ai, which analyzes code repository and finds problems in the code. It is a great tool, but it has been trained on large data sets from open source projects, which makes it tricky to use on proprietary software. It does not know how to capture the project specific characteristics.
I’ve recently cam across this article (https://link-springer-com.ezproxy.ub.gu.se/content/pdf/10.1007/s10664-019-09730-9.pdf) which is about generating code comments. It can generate code comments (not review comments) for the Java programming language. It “understands” what the code does and generates a natural language description of the code. The tool is based on the latest work from the NLP domain and has been trained on over 44 million statements, which is quite a number😊
The tool is not perfect (yet), but shows improvement over existing approaches and is definitely opening up new alleys of using NLP in Software Engineering.
Deep learning models are often designed, trained and tested in Python. It is a language with a nice structure, quite straigtforward syntax and a lot of libraries. However, very few tutorials about deep learning (or any Python programming tutorials) discuss the quality of the code, e.g. its modularization, encapsulation, naming consistency.
As a result, a lot of code for machine learning, written in Python, often is hard to read and hard to grasp. Even if used as part of jupyter notebooks, the code is not really commented (often).
The study behind the link above is a study that supports my long gut feeling about this. The findings show that (from the abstract): First, long lambda expression, long ternary conditional expression, and complex container comprehension smells are frequently found in deep learning projects. That is, deep learning code involves more complex or longer expressions than the traditional code does. Second, the number of code smells increases across the releases of deep learning applications. Third, we found that there is a co-existence between code smells and software bugs in the studied deep learning code, which confirms our conjecture on the degraded code quality of deep learning applications.