{"id":1035,"date":"2026-07-03T10:50:12","date_gmt":"2026-07-03T09:50:12","guid":{"rendered":"https:\/\/metrics.blogg.gu.se\/?p=1035"},"modified":"2026-06-05T10:54:14","modified_gmt":"2026-06-05T09:54:14","slug":"what-are-you-talking-about-one-agent-asked-another","status":"publish","type":"post","link":"https:\/\/metrics.blogg.gu.se\/?p=1035","title":{"rendered":"What are you talking about &#8211; one agent asked another&#8230;"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/srijita_paper.png\"><img loading=\"lazy\" decoding=\"async\" width=\"585\" height=\"419\" src=\"https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/srijita_paper.png\" alt=\"\" class=\"wp-image-1036\" srcset=\"https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/srijita_paper.png 585w, https:\/\/metrics.blogg.gu.se\/files\/2026\/06\/srijita_paper-300x215.png 300w\" sizes=\"(max-width: 585px) 85vw, 585px\" \/><\/a><\/figure>\n\n\n\n<p>Image taken directly from the paper<\/p>\n\n\n\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2605.24138\">https:\/\/arxiv.org\/pdf\/2605.24138<\/a><\/p>\n\n\n\n<p class=\"has-drop-cap\">The Software Engineering (SE) landscape is shifting from LLM-assisted workflows, like copilots, toward Autonomous SE, where multiple specialized AI agents cooperate without a human in the loop. The premise is exciting: a &#8216;Designer&#8217; agent creates the plan, and a &#8216;Programmer&#8217; agent implements it. Yet, simply letting agents talk to each other does not reliably lead to correct or stable solutions. In our new paper, my colleagues and I undertake a systematic analysis to understand why.<\/p>\n\n\n\n<p>We explored conversations between a Designer and a Programmer across 12 combinations from 7 leading open-source models\u2014including Gemma 2\/3, LLaMA 3.2\/3.3, Qwen3, and the reasoning-focused DeepSeek-R1\u2014as they tried to build a mathematical game in C (Fibonacci). We found that the interactions are complex, non-linear, and prone to surprising failures.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Echo Chambers Instead of Collaboration<\/h3>\n\n\n\n<p>One of our most critical, blog-worthy findings is that common metrics used to measure conversational &#8220;success,&#8221; like high BLEU and ROUGE scores (which track semantic alignment), can be misleading. In mismatched pairs, particularly involving non-reasoning models (like Gemma 3 or MiniCPM) paired with reasoning models (DeepSeek-R1), high scores were a red flag for &#8220;semantic echoing.&#8221; The Programmer agent simply mirrored the Designer\u2019s output verbatim, which was a conversational failure, not a collaborative victory.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">DeepSeek-R1: The Lone Convergent Pair<\/h3>\n\n\n\n<p>In terms of actual solution correctness, the results were stark. Only a single agent pair\u2014<strong>DeepSeek-R1 paired with itself<\/strong>\u2014was unique in immediately converging to the correct solution and sustaining it consistently to the final iteration. This indicates that while reasoning capabilities are crucial, stable collaboration currently depends more on consistent role conditioning. Our manual inspections showed that DeepSeek-R1:DeepSeek-R1 prioritized design discussion over echoing, which contributed to its success despite having some &#8220;No Code Found&#8221; instances, often related to compilation instructions. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Talking Themselves Out of Success: The Threat of Drift<\/h3>\n\n\n\n<p>We also identified a critical trend we call &#8220;behavioral stagnation&#8221; and &#8220;drift.&#8221; Multiple promising pairs\u2014including Qwen3:DeepSeek-R1, DeepSeek-R1:LLaMA 3.3, and even a same-model pair, LLaMA 3.3:LLaMA 3.3\u2014actually <strong>started<\/strong> with the correct solution. However, they subsequently <em>talked themselves out of it<\/em>, diverging to other topics (like related number theories or other code snippets) and never converging again.<\/p>\n\n\n\n<p>As we noted in our analysis, late recovery from this kind of drift is unlikely. This provides an essential behavioral signal for SE tools developers: you must monitor the health of the interaction trace (for repetition, topic drift, or role instability) in real-time, rather than relying solely on whether code is eventually produced. Monitoring these conversational patterns can inform early stopping conditions or trigger prompt revisions before computational time is wasted on non-productive exchanges.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Takeaway<\/h3>\n\n\n\n<p>As Software Engineering transitions to autonomous agent teams, understanding and calibrating these multi-agent interaction dynamics is critical. Strong semantic alignment does not ensure correctness, and reasoning capability alone does not guarantee stable collaboration. You need clear role separation, pair compatibility, and robust monitors that can detect conversational drift. Success isn&#8217;t a final code snippet; it&#8217;s a healthy conversation. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Image taken directly from the paper https:\/\/arxiv.org\/pdf\/2605.24138 The Software Engineering (SE) landscape is shifting from LLM-assisted workflows, like copilots, toward Autonomous SE, where multiple specialized AI agents cooperate without a human in the loop. The premise is exciting: a &#8216;Designer&#8217; agent creates the plan, and a &#8216;Programmer&#8217; agent implements it. Yet, simply letting agents talk &hellip; <a href=\"https:\/\/metrics.blogg.gu.se\/?p=1035\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;What are you talking about &#8211; one agent asked another&#8230;&#8221;<\/span><\/a><\/p>\n","protected":false},"author":68,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6,7,4,5],"tags":[],"_links":{"self":[{"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/posts\/1035"}],"collection":[{"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/users\/68"}],"replies":[{"embeddable":true,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1035"}],"version-history":[{"count":1,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/posts\/1035\/revisions"}],"predecessor-version":[{"id":1037,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=\/wp\/v2\/posts\/1035\/revisions\/1037"}],"wp:attachment":[{"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1035"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1035"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/metrics.blogg.gu.se\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1035"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}