January 2023 – SE metrics (Software Engineering)

Creating your own models

Last week I wrote about our seminar and Co-pilot. I’m sure that has stimulated a lot of thoughts on these language models. Many think that this is a difficult task to create, train and use them.

Nothing further from the truth. If you are interested in training such a model from scratch, I recommend the following book (in particular Chapter 4 if you are anxious to get started).

Transformers for Natural Language Processing | Packt (packtpub.com)

The book explains how these models work for natural language processing, but making it work for source code is trivial. Use your code instead of the provided text and there you go. You need a GPU or use some cloud service, otherwise you will wait forever.

But if you have it, you can get really cool results within a day or two.

Good luck!

GitHub Co-pilot and code generation

So, this week’s post is my reflection on the seminar that we hosted last week (the recording is above). It was an eye-opener for me in a few aspects.

For the first, it was the question of ownership of things. Since AI is not a subject in legal cases, it cannot really own anything. I know, AI and computational models are not the same, but for the sake of the argument let’s assume that they are. By the end of the day, it is still a human being that presses the button and generates new source code or comments or what have you. So, the responsibility is still very much on us when we use these tools.

The second, it was the question about the community and why we have open-source software. We certainly do not put our source code openly for someone to profit from it. Attribution and recognition are very important (if not the most important) aspects of any open-source community. So, using their code to create commercial models requires at least some attribution. Why not show which code was used to train these models and show how good the communities really are?

Finally, my main point still stands – we should use these models to become better. They make us so much more productive that we should not go back to the old ways of writing software. Providing suggestions and ideas to programmers can make our software better, shipped faster and potentially more reliable.

However, we need to make sure that we change the way we attribute the software. Myself, I will start to add “co-created by Github Co-pilot and the OSS communities” to my work when I use the tool. Maybe you can do that too? At least to give some attribution back to our countless colleagues who deserve it….