
Image generated by Gemini based on the content of this post
https://arxiv.org/pdf/2605.19102
Getting Large Language Models (LLMs) to write functional code often feels like casting spells; a slight misphrasing in your prompt can result in a buggy output. This is even more important now that we have agents which work for days on our tasks.
The core issue is that while LLMs are powerful, their code generation performance is highly sensitive to prompt formulation. Traditional manual engineering is tedious, and existing automated techniques often treat prompt modifications—like lexical edits or semantic rewriting—in isolation. They also typically rely on binary (pass/fail) signals, ignoring valuable information about partial correctness.
When I was at VECS, I got to meet that Swedish Champion in prompting. He told me that the best technique is to use LLMs to create prompts. This paper embraces that idea and goes even further – creating a full reinforcement learning framework to make prompts.
In this paper, the agent is guided by shaped rewards derived from unit-test feedback. Instead of just rewarding full passes, the system provides denser learning signals by rewarding the proportion of test cases passed. This enables the agent to discover sequences of transformations that progressively improve the functional correctness of the generated code.
The framework was evaluated on a few widely known benchmarks (MBPP+, HumanEval+, APPS) using three code generators: CodeT5+, CodeLLaMA, and DeepSeek-Coder. On the MBPP+ test set (500 tasks), the PPO agent achieved strict Pass@1 scores of:
- 57.58% for CodeT5+
- 64.80% for CodeLLaMA
- 85.50% for DeepSeek-Coder
These results significantly outperformed direct generation and existing iterative strategies like EPIC and Reflexion. Furthermore, comparison against a “Random-Hybrid” baseline confirmed that the gains aren’t just from having the transformation tools, but from the agent learning how to intelligently schedule them based on feedback.
The key takeaway is clear: feedback-driven, multi-step RL optimization can move code generation beyond manual prompt engineering, providing an adaptive, automated path to functionally correct code.







