Deep learning in software engineering has been used extensively and there is a significant body of research about this topic. In this post, I would like to share my review of the recent systematic review on the use of DL in SE.
The interesting finding is the list of data sources for data for DL. Here, the source code data is prevalent. This is not surprising as we have GitHub with millions of repositories. The second largest is the repository metadata, again, for the same reason.
Although it is not surprising, it is really good to see this. I see it as a change in the research focus in the last 10 years. It shifted from the research on bugs and bug reports to the research on source code. I’m happy because helping out with the source code is the real improvement of the product, not an improvement of the process.
Another interesting finding is the use of natural language techniques as the most common ones, here I cite the paper: “Our analysis found that, while a number of different data pre-processing techniques have been utilized, tokenization and neural embeddings are by far the two most prevalent. We also found that data-preprocessing is tightly coupled to the DL model utilized, and that the SE task and publication venue were often strongly associated with specific types of pre-processing techniques.“
I recommend to read the article to get more insight into DL models used, which are quite many – from the standard cNN to more advanced GANs and AutoEncoders. Really nice!
Finally, the paper ends with a recommendation on how to use DL in other contexts, kind of flowchart. I do not want to copy it here, so I recommend to take a look at it in the paper: https://arxiv.org/pdf/2009.06520.pdf