50 Language/Code models, let’s talk…

BIld av pencil parker från Pixabay

As you have probably observed I’ve been into language models for code analysis, design and recognition. It’s a great way of spending your research time as it gives you the possibility to understand how we program and understand how to model that. In my personal case, this is a great complement to the empirical software engineering research that I do otherwise.

In the recent time I got a feeling that I look into more and more of these models, all of them baring certain similarity to the Google’s BERT model or the Fracebook’s TransCoder. So I set off to do a short review of the papers that actually talk about code models or, as they are often called, programming language models. I started from the paper describing CodeBERT ( [2002.08155] CodeBERT: A Pre-Trained Model for Programming and Natural Languages (arxiv.org) ) and looked at the 500 citations that the model has. The list below is just the list of the models that are created based on CodeBERT. There are also models created based on AlphaGo or Github CoPilot, but I leave these for another occasion.

I must admit that I did not read all of these papers and did not look at all of these models. Far from it, I only looked at some of them. My conclusion is that we have a lot of models, but the quality of the results vary a lot. The best models provide good results in ca. 20% of cases. AlphaCode is an example of such a model, which is fantastic, but not super-accurate all the time. As the model is used for super-competitive tasks, 20% is actually very impressive – it’s difficult to say that I would do better for these programming competitions, so I’m not criticizing.

The best model I’ve seen so far, however, is the Github CoPilot, which is by far the best model to create code that the world has seen. Well, there may be models that the world has not seen, but then they do not count. If you would like to see a preview of how I use it (part I), you can take a look at this video:

I sincerely hope that you find this list useful and that you can help me to keep it updated – drop me an e-mail about the list if you want to:

  1. AlphaGo: https://www.deepmind.com/blog/competitive-programming-with-alphacode
  2. TransCoder: https://github.com/facebookresearch/TransCoder
  3. CodeT5: https://arxiv.org/pdf/2109.00859 
  4. CodeITT5: https://arxiv.org/pdf/2208.05446 
  5. ProphetNet: https://arxiv.org/pdf/2104.08006 
  6. Cotex: https://arxiv.org/pdf/2105.08645 
  7. Commit2vec: https://arxiv.org/pdf/1911.07605 
  8. CoreGen: https://www.sciencedirect.com/science/article/pii/S092523122100792X  
  9. SyncoBERT: https://arxiv.org/pdf/2108.04556 
  10. TreeBERT: https://proceedings.mlr.press/v161/jiang21a/jiang21a.pdf 
  11. FastSpec: https://ieeexplore.ieee.org/iel7/9581154/9581061/09581258.pdf 
  12. CVEFixes: https://dl.acm.org/doi/pdf/10.1145/3475960.3475985 
  13. CodeNet: https://arxiv.org/pdf/2105.12655
  14. Graph4Code: https://www.researchgate.net/profile/Jamie-Mccusker-2/publication/339445570_Graph4Code_A_Machine_Interpretable_Knowledge_Graph_for_Code/links/5fd2a29a45851568d154cfaa/Graph4Code-A-Machine-Interpretable-Knowledge-Graph-for-Code.pdf 
  15. DeGraphCE: https://dl.acm.org/doi/pdf/10.1145/3546066 
  16. VELVET: https://ieeexplore.ieee.org/iel7/9825713/9825693/09825786.pdf
  17. Code2Vec: https://uwspace.uwaterloo.ca/bitstream/handle/10012/15862/Arumugam_Lakshmanan.pdf?sequence=9&isAllowed=y 
  18. MulCode: https://ieeexplore.ieee.org/iel7/9425868/9425874/09426045.pdf 
  19. Flakify: https://ieeexplore.ieee.org/iel7/32/4359463/09866550.pdf 
  20. CoDesc: https://arxiv.org/pdf/2105.14220 
  21. NatGen: https://arxiv.org/pdf/2206.07585 
  22. Coctail: https://arxiv.org/pdf/2106.05345 
  23. MergeBERT: https://arxiv.org/pdf/2109.00084 
  24. SPTCode: https://dl.acm.org/doi/pdf/10.1145/3510003.3510096 
  25. InCoder: https://arxiv.org/pdf/2204.05999 
  26. JavaBERT: https://ieeexplore.ieee.org/iel7/9680270/9679822/09680322.pdf 
  27. BERT2Code: https://arxiv.org/pdf/2104.08017 
  28. NeuralCC: https://arxiv.org/pdf/2012.03225 
  29. LineVD: https://arxiv.org/pdf/2203.05181 
  30. GraphCode2Vec: https://arxiv.org/pdf/2112.01218 
  31. ASTBERT: https://arxiv.org/pdf/2201.07984 
  32. CodeRL: https://arxiv.org/pdf/2207.01780 
  33. CV4Code: https://arxiv.org/pdf/2205.08585 
  34. NaturalCC: https://xcodemind.github.io/papers/icse22_naturalcc_camera_submitted.pdf 
  35. StructCode: https://arxiv.org/pdf/2206.05239   
  36. VulBERT: https://arxiv.org/pdf/2205.12424 
  37. CodeMVP: https://arxiv.org/pdf/2205.02029 
  38. miBERT: https://ieeexplore.ieee.org/iel7/9787917/9787918/09787973.pdf?casa_token=rPNbu-k9Gh4AAAAA:3lkZVyUjnDP4Sp1UmmO9eVftsRaf1zAuw1YhHQogsyDBE2Y7992gBlhPb9jKVcI-5Q8tTv2JEyQ 
  39. LineVUL: https://www.researchgate.net/profile/Chakkrit-Tantithamthavorn/publication/359402890_LineVul_A_Transformer-based_Line-Level_Vulnerability_Prediction/links/623ee3d48068956f3c4cbede/LineVul-A-Transformer-based-Line-Level-Vulnerability-Prediction.pdf 
  40. CommitBART: https://arxiv.org/pdf/2208.08100 
  41. GAPGen: https://arxiv.org/pdf/2201.08810 
  42. El-CodeBERT: https://dl.acm.org/doi/pdf/10.1145/3545258.3545260?casa_token=DNyXQpkP69MAAAAA:y2iJC3RliEh7yJ6SzRpRRKrzPn2Q6w25vpm5vpoN0TksDh_SbmVfa_8JcDxvVN8FydOL_vTJqH-6OA 
  43. COCLUBERT: https://ieeexplore.ieee.org/iel7/9679834/9679948/09680081.pdf?casa_token=FtrqlHTmm74AAAAA:kkMyRsMl9xqPQQSBTRd6vFD-2-DyVSomYBYqm8u8aKs7B0_rkYYfL_OLVmOHgzn1-vqMF6W7pM8 
  44. Xcode: https://dl.acm.org/doi/pdf/10.1145/3506696?casa_token=5H8iW3e2GlYAAAAA:m2QA-DXSk5LZYazFxDPEVfLZcYREqDomXNg5YmkR-rPllHD37Qd8eLw_SCu6rbhNHZJ2Od24dvJt_Q 
  45. CobolBERT: https://arxiv.org/pdf/2201.09448 
  46. SiamBERT: https://melqkiades.github.io/files/download/papers/siambert-sais-2022.pdf 
  47. CodeReviewer: https://arxiv.org/pdf/2203.09095 
  48. CodeBERT-nt: https://arxiv.org/pdf/2208.06042 
  49. BashExplainer: https://arxiv.org/pdf/2206.13325

Author: Miroslaw Staron

I’m professor in Software Engineering at IT faculty. I usually blog about interesting articles (for me) and my own reflections on the development of Software Engineering, AI, computer science and automotive software.