{"id":1030,"date":"2026-06-18T10:45:50","date_gmt":"2026-06-18T09:45:50","guid":{"rendered":"https:\/\/metrics.blogg.gu.se\/?p=1030"},"modified":"2026-06-05T10:47:31","modified_gmt":"2026-06-05T09:47:31","slug":"can-we-force-llms-to-generate-the-code-we-really-want","status":"publish","type":"post","link":"https:\/\/metrics.blogg.gu.se\/?p=1030","title":{"rendered":"Can we force LLMs to generate the code we really want?"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/viktor_paper.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"395\" src=\"https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/viktor_paper-1024x395.png\" alt=\"\" class=\"wp-image-1031\" srcset=\"https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/viktor_paper-1024x395.png 1024w, https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/viktor_paper-300x116.png 300w, https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/viktor_paper-768x297.png 768w, https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/viktor_paper-1200x463.png 1200w, https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/viktor_paper.png 1261w\" sizes=\"(max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/a><\/figure>\n\n\n\n<p>Experiment design &#8211; from the paper<\/p>\n\n\n\n<p class=\"has-drop-cap\">Large Language Models (LLMs) are revolutionary for programming productivity, producing functional code snippets in seconds. However, as software engineers, my co-authors and I know that &#8220;functional&#8221; is not the same as &#8220;well-designed.&#8221; LLMs are generally &#8220;bottom-up&#8221; thinkers; they excel at local syntax but struggle to adhere to higher-level architectural structures or design patterns, which are crucial for long-term software maintainability and scalability.<\/p>\n\n\n\n<p>In our new paper, presented at PROMISE &#8217;26, we set out to answer a critical question: How can we best guide LLMs to incorporate design patterns into their generated code without sacrificing functional correctness?<\/p>\n\n\n\n<p>We decided to use the standard Singleton creational pattern as our case study due to its easily identifiable predicates. We designed a computational experiment evaluating 13 state-of-the-art LLMs (including GPT-4o Mini, Llama 3.3, and Qwen 3) across 164 Java coding challenges from HumanEval-X. We tested four distinct prompting strategies: simple natural language instructions, iterative binary automated feedback (&#8220;Is it Singleton? Yes\/No&#8221;), extensive automated feedback identifying exactly which Singleton properties were missing, and extensive feedback combined with few-shot examples.<\/p>\n\n\n\n<p>Our findings reveal that there is no one-size-fits-all prompting solution; the optimal strategy is highly model-dependent. However, a major takeaway is that even simple strategies work remarkably well. Overall, <strong>iterative binary feedback<\/strong> provided the best balance, maximizing alignment with the Singleton pattern while preserving or even improving the code&#8217;s functionality.<\/p>\n\n\n\n<p>Surprisingly, enforcing design principles didn&#8217;t always hurt performance. For strong models like Llama 3.3, just instructing it to use Singleton resulted in 100% pattern adherence and actually <em>increased<\/em> functional test pass rates by 34 percentage points compared to the baseline.<\/p>\n\n\n\n<p>Our study proves we <em>can<\/em> teach LLMs good design habits using automated feedback loops. You can read the full paper and access our experimental data here <a href=\"https:\/\/arxiv.org\/pdf\/2605.26898\">https:\/\/arxiv.org\/pdf\/2605.26898<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Experiment design &#8211; from the paper Large Language Models (LLMs) are revolutionary for programming productivity, producing functional code snippets in seconds. However, as software engineers, my co-authors and I know that &#8220;functional&#8221; is not the same as &#8220;well-designed.&#8221; LLMs are generally &#8220;bottom-up&#8221; thinkers; they excel at local syntax but struggle to adhere to higher-level architectural &hellip; <a href=\"https:\/\/metrics.blogg.gu.se\/?p=1030\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Can we force LLMs to generate the code we really want?&#8221;<\/span><\/a><\/p>\n","protected":false},"author":68,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6,4],"tags":[],"_links":{"self":[{"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/posts\/1030"}],"collection":[{"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/users\/68"}],"replies":[{"embeddable":true,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1030"}],"version-history":[{"count":1,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/posts\/1030\/revisions"}],"predecessor-version":[{"id":1032,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/posts\/1030\/revisions\/1032"}],"wp:attachment":[{"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1030"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1030"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1030"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}