Due to my background in software metrics, I’ve been interested about measurement of AI systems for a while. What I found is that there are benchmarks and suites of metrics used for measurement of AI. But….
When GPT-5 was announced, most of the metrics that they showed improved by 1-2%. They improved from 97-99%, which made me wonder whether we are so perfect or whether we need new ways of measuring generative AI systems.
As I see it, we need new metrics and new benchmarks. I like the “humanity’s last exam” benchmark, because it is still not saturated. It is a great benchmark, but if we have a perfect score on that benchmark, will it be able to constuct good generative AI software? Or will we make software that is very good in theory and not useful in practice?
In this article, the authors offer an opinion on this topic, supporting my view and also indicating that source code generation is one of the areas where metrics are getting more mature than in others. CodeBLEU and CodeROUGE are better than their non-code correspondence. This is because the take domain knowledge into the consideration.
Let’s see what new benchmarks will pop up when GPT-5 becomes even more popular.
OpenAI’s latest keynote put one idea forward: coding is shifting from writing lines to expressing intent. With GPT-5’s push into agentic workflows—and concrete coding gains on benchmarks like SWE-bench Verified—the “software on demand” era is no longer speculative. You describe behavior; an agent plans, scaffolds, implements, runs tests, and iterates. Humans stay in the loop as product owners and reviewers.
What’s different now isn’t just better autocomplete. OpenAI’s platform updates (Responses API + agent tooling) are standardizing how models call tools, navigate repos, and execute tasks, turning LLMs into reliable collaborators rather than clever chatboxes. The keynote storyline mirrored what many teams are seeing: agents that can reason across files, operate tests, and honor constraints—then explain their choices.
There’s still daylight between today’s agents and fully autonomous engineers—OpenAI itself acknowledged the limits—but the arc is clear. In the near term, expect product teams to specify features as executable specs: a prompt plus acceptance tests. Agents draft code; CI catches regressions; humans approve merges. The payoff is faster iteration and broader access: more people can “program” without memorizing frameworks, while specialists curate architecture, performance, and safety. The Guardian
If you’re experimenting, start small: encode user stories as tests, let an agent propose patches, and gate everything behind your normal review. The orgs that win won’t be the ones that replace engineers—they’ll be the ones that instrument intent, tests, and guardrails so agents can ship value on demand.
I’m already pass that – experimenting at large with prompts writing requirements, LLMs using design patterns and developing add-ins to Visual Studio to make these tools available.
Suggested research & resources
SWE-bench & SWE-bench Verified – Real-world GitHub issue benchmark (plus a human-validated subset) used to measure end-to-end software fixing by LLMs/agents. Great for evaluating “software on demand” claims. arXivOpenAI
SWE-agent (NeurIPS 2024) – Shows that agent-computer interfaces (file navigation, test execution) dramatically improve automated software engineering. Useful design patterns for your own agents. proceedings.neurips.ccarXiv
AutoDev (Microsoft, 2024) – Framework for autonomous planning/execution over repos, with strong results on code and test generation; a good reference for multi-tool agent loops. arXivVisual Studio Magazine
OpenAI: New tools for building agents (2025) – Overview of the Responses API and how to wire tools/function-calling for robust agent behavior. OpenAI
In the last few days, OpenAI announced their newest model. The model seems to be really good. In fact, it is so good that the increase from the previous ones are in only 1% in some cases (from 98% to 99%). This means that we need better benchmarks to show how the models differ.
Well, the model is really something that I want to use now, not wait until next week. If this is the latest, I do not know what the next one will be!