
Code generation in all forms, from solving problems to test case creation, adversarial testing, fixing security vulnerability, is super-popular in contemporary software engineering. It helps the engineers to be more efficient in their work, and it helps the managers to get more out of the resources at their disposal.
However, there is a bit of a darker side to it. We can all buy GitHub CoPilot or another add-in. We can even get the companies to set a special instance in their cloud for us. But it 1) costs and 2) uses a lot of energy, and 3) is probably a security risk as we need to send our code via an open network.
The alternative is to use publicly available models and create an add-in for the tool to use it. Fully on-site, no information even leaves premises (https://bit.ly/4iMNgrU). But, how good are these models.
In this paper, the authors studied models that are in the class that we can run on a modern desktop computer. Yes, we need a GPU for them, but a 4090 or 5090 would be enough. They tested: Llama, Gemma 2, Gemma 3, DeepSeek-R1 and Phi-4. They found that the Phi-4 model was a bit worse than the OpenAI’s GPT 03-mini-high, but it was very close. The 03-mini-high got ca. 87% correctness with pass@3, and the Phi-4 achieves 64%. Yes, not exactly the same, but still damn close.
I suggest to read the paper for everyone looking into possibilities of using these models on their own.