Software on Demand: from IDEs to Intent

OpenAI’s latest keynote put one idea forward: coding is shifting from writing lines to expressing intent. With GPT-5’s push into agentic workflows—and concrete coding gains on benchmarks like SWE-bench Verified—the “software on demand” era is no longer speculative. You describe behavior; an agent plans, scaffolds, implements, runs tests, and iterates. Humans stay in the loop as product owners and reviewers.

What’s different now isn’t just better autocomplete. OpenAI’s platform updates (Responses API + agent tooling) are standardizing how models call tools, navigate repos, and execute tasks, turning LLMs into reliable collaborators rather than clever chatboxes. The keynote storyline mirrored what many teams are seeing: agents that can reason across files, operate tests, and honor constraints—then explain their choices.

There’s still daylight between today’s agents and fully autonomous engineers—OpenAI itself acknowledged the limits—but the arc is clear. In the near term, expect product teams to specify features as executable specs: a prompt plus acceptance tests. Agents draft code; CI catches regressions; humans approve merges. The payoff is faster iteration and broader access: more people can “program” without memorizing frameworks, while specialists curate architecture, performance, and safety. The Guardian

If you’re experimenting, start small: encode user stories as tests, let an agent propose patches, and gate everything behind your normal review. The orgs that win won’t be the ones that replace engineers—they’ll be the ones that instrument intent, tests, and guardrails so agents can ship value on demand.

I’m already pass that – experimenting at large with prompts writing requirements, LLMs using design patterns and developing add-ins to Visual Studio to make these tools available.


Suggested research & resources

  • SWE-bench & SWE-bench Verified – Real-world GitHub issue benchmark (plus a human-validated subset) used to measure end-to-end software fixing by LLMs/agents. Great for evaluating “software on demand” claims. arXivOpenAI
  • SWE-agent (NeurIPS 2024) – Shows that agent-computer interfaces (file navigation, test execution) dramatically improve automated software engineering. Useful design patterns for your own agents. proceedings.neurips.ccarXiv
  • AutoDev (Microsoft, 2024) – Framework for autonomous planning/execution over repos, with strong results on code and test generation; a good reference for multi-tool agent loops. arXivVisual Studio Magazine
  • OpenAI: New tools for building agents (2025) – Overview of the Responses API and how to wire tools/function-calling for robust agent behavior. OpenAI

GPT-5 – the best and the greatest?

In the last few days, OpenAI announced their newest model. The model seems to be really good. In fact, it is so good that the increase from the previous ones are in only 1% in some cases (from 98% to 99%). This means that we need better benchmarks to show how the models differ.

Well, the model is really something that I want to use now, not wait until next week. If this is the latest, I do not know what the next one will be!