{"id":1024,"date":"2026-06-05T10:34:23","date_gmt":"2026-06-05T09:34:23","guid":{"rendered":"https:\/\/metrics.blogg.gu.se\/?p=1024"},"modified":"2026-06-05T10:48:15","modified_gmt":"2026-06-05T09:48:15","slug":"junior-architects-with-shaky-logic-testing-ais-real-world-coding-skills-article-review","status":"publish","type":"post","link":"https:\/\/metrics.blogg.gu.se\/?p=1024","title":{"rendered":"Junior Architects with Shaky Logic: Testing AI\u2019s Real-World Coding Skills &#8211; article review"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/junior_architects.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"354\" src=\"https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/junior_architects-1024x354.jpg\" alt=\"\" class=\"wp-image-1025\" srcset=\"https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/junior_architects-1024x354.jpg 1024w, https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/junior_architects-300x104.jpg 300w, https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/junior_architects-768x266.jpg 768w, https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/junior_architects-1536x531.jpg 1536w, https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/junior_architects-1200x415.jpg 1200w, https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/junior_architects-1320x457.jpg 1320w, https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/junior_architects.jpg 1758w\" sizes=\"(max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/a><\/figure>\n\n\n\n<p>Image generated by Gemini based on the blog post content<\/p>\n\n\n\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2604.23340\">https:\/\/arxiv.org\/pdf\/2604.23340<\/a><\/p>\n\n\n\n<p class=\"has-drop-cap\">We have all seen Large Language Models (LLMs) write impressive snippets of code or debug a tricky function. AI coding editors like GitHub Copilot are increasingly adopted, with studies suggesting that up to 88% of developers report increased productivity.<\/p>\n\n\n\n<p>But accelerations in development come with trade-offs. Existing studies have shown that LLMs often misuse APIs, introduce security vulnerabilities, and hallucinate. So I got to wonder: <em>Can an LLM actually understand the soul of a complex software project? Can it generate a fully automated, high-quality commit (patch) that satisfies requirements and can be directly incorporated into a major production codebase?<\/em><\/p>\n\n\n\n<p>This paper puts this question to the test, because it uses actual commits from substantial, real-world open-source systems. The authors developed an automated framework to assess how suitable LLMs are at fixing bugs and adding new features to sizable code bases. They applied this framework to 212 actual commits across eight popular open-source projects\u2014including <strong>FFmpeg<\/strong> and <strong>wolfSSL<\/strong>\u2014and three LLMs: <strong>GPT-4o<\/strong>, <strong>Ministral3-14B<\/strong>, and <strong>Qwen3-Coder-30B<\/strong>.<\/p>\n\n\n\n<p>The framework tested the generated patches on three levels:<\/p>\n\n\n\n<ol start=\"1\">\n<li><strong>Verification:<\/strong> Does the generated code compile?<\/li>\n\n\n\n<li><strong>Validation (Static Analysis):<\/strong> Does it pass Clang\u2019s static analysis checkers (e.g., memory safety checks)?<\/li>\n\n\n\n<li><strong>Validation (Dynamic Testing):<\/strong> Does it pass the project\u2019s existing test suite?<\/li>\n<\/ol>\n\n\n\n<p>The success rate varied wildly\u2014from 0% on certain projects up to 60% on others. But overall, the verdict was clear: LLMs are not at a point where they can be effective contributors to production code. They still hallucinate, and they still have large limitations &#8211; at least the ones tested, we&#8217;ll see what the newest ones could do. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Takeaway for Architects and Developers<\/h2>\n\n\n\n<p>The bottom line is clear: Do not trust LLMs de novo with critical production system code. They are effective for small functions, feature improvements, routine algorithms, and tasks similar to those seen in their training data.<\/p>\n\n\n\n<p>However, the risk of &#8220;silent failures,&#8221; new security vulnerabilities, and logic regression means that rigorous human validation remains the most important step when integrating AI-generated contributions. Still, even in 2026!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Image generated by Gemini based on the blog post content https:\/\/arxiv.org\/pdf\/2604.23340 We have all seen Large Language Models (LLMs) write impressive snippets of code or debug a tricky function. AI coding editors like GitHub Copilot are increasingly adopted, with studies suggesting that up to 88% of developers report increased productivity. But accelerations in development come &hellip; <a href=\"https:\/\/metrics.blogg.gu.se\/?p=1024\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Junior Architects with Shaky Logic: Testing AI\u2019s Real-World Coding Skills &#8211; article review&#8221;<\/span><\/a><\/p>\n","protected":false},"author":68,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"_links":{"self":[{"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/posts\/1024"}],"collection":[{"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/users\/68"}],"replies":[{"embeddable":true,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1024"}],"version-history":[{"count":2,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/posts\/1024\/revisions"}],"predecessor-version":[{"id":1034,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/posts\/1024\/revisions\/1034"}],"wp:attachment":[{"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1024"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1024"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1024"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}