Understanding programming language is an important topic in research in the area of programming language models. I’ve written before that there are ca. 50 programming language models, which we can use in software engineering. Ok, not all of them are equivalent and they are specific to the task, but they are available, so we can use and customize them.
Now, whether 50 models is a lot or not is debatable. Compared to natural language models this is a small number. Even compared to the number of programming languages this number is not impressive. However, how many languages are used widely – 10-15? Java, C, C++, Python, JavaScript, Rust, Go, and derivatives are the most common ones.
This article is a study done by our colleagues from the department. It’s too long to quote in detail, but there are a few things that I like. First, it’s a good overview of the types of language models:
- Token-based representation: when the program code is basically a set of tokens/words; some can have a special meaning, but they are just words (I’ve written about this before, even worked with it: GitHub – mochodek/py-ccflex: py-ccflex – Python Flexible Code Classifier )
- Tree-based representation: when the program code is seen from the perspective of their Abstract-Syntax-Tree, an example is the code2vec model: code2vec
- Graph-based models: when the program code is seen as a directed graph, e.g., a control flow graph
Although I like this classification, I see that it misses one of the most prominent and the most popular one – the NLP based model. It is a type of model where the program code is seen as a set of sentences that have meaning of some sort. It is a derivative of the token-based representation, but it is much more than that. CodeX from OpenAI is an example of such model.
Nevertheless, this study provides a very interesting set of examples of models and their applications. I sincerelly suggest to take a look at this paper to understand how the models work and where they are used best.