Miroslaw Staron – SE metrics (Software Engineering)

Measuring AI

How Do You Measure AI? | Communications of the ACM

Due to my background in software metrics, I’ve been interested about measurement of AI systems for a while. What I found is that there are benchmarks and suites of metrics used for measurement of AI. But….

When GPT-5 was announced, most of the metrics that they showed improved by 1-2%. They improved from 97-99%, which made me wonder whether we are so perfect or whether we need new ways of measuring generative AI systems.

As I see it, we need new metrics and new benchmarks. I like the “humanity’s last exam” benchmark, because it is still not saturated. It is a great benchmark, but if we have a perfect score on that benchmark, will it be able to constuct good generative AI software? Or will we make software that is very good in theory and not useful in practice?

In this article, the authors offer an opinion on this topic, supporting my view and also indicating that source code generation is one of the areas where metrics are getting more mature than in others. CodeBLEU and CodeROUGE are better than their non-code correspondence. This is because the take domain knowledge into the consideration.

Let’s see what new benchmarks will pop up when GPT-5 becomes even more popular.

Software on Demand: from IDEs to Intent

OpenAI’s latest keynote put one idea forward: coding is shifting from writing lines to expressing intent. With GPT-5’s push into agentic workflows—and concrete coding gains on benchmarks like SWE-bench Verified—the “software on demand” era is no longer speculative. You describe behavior; an agent plans, scaffolds, implements, runs tests, and iterates. Humans stay in the loop as product owners and reviewers.

What’s different now isn’t just better autocomplete. OpenAI’s platform updates (Responses API + agent tooling) are standardizing how models call tools, navigate repos, and execute tasks, turning LLMs into reliable collaborators rather than clever chatboxes. The keynote storyline mirrored what many teams are seeing: agents that can reason across files, operate tests, and honor constraints—then explain their choices.

There’s still daylight between today’s agents and fully autonomous engineers—OpenAI itself acknowledged the limits—but the arc is clear. In the near term, expect product teams to specify features as executable specs: a prompt plus acceptance tests. Agents draft code; CI catches regressions; humans approve merges. The payoff is faster iteration and broader access: more people can “program” without memorizing frameworks, while specialists curate architecture, performance, and safety. The Guardian

If you’re experimenting, start small: encode user stories as tests, let an agent propose patches, and gate everything behind your normal review. The orgs that win won’t be the ones that replace engineers—they’ll be the ones that instrument intent, tests, and guardrails so agents can ship value on demand.

I’m already pass that – experimenting at large with prompts writing requirements, LLMs using design patterns and developing add-ins to Visual Studio to make these tools available.

Suggested research & resources

SWE-bench & SWE-bench Verified – Real-world GitHub issue benchmark (plus a human-validated subset) used to measure end-to-end software fixing by LLMs/agents. Great for evaluating “software on demand” claims. arXiv OpenAI
SWE-agent (NeurIPS 2024) – Shows that agent-computer interfaces (file navigation, test execution) dramatically improve automated software engineering. Useful design patterns for your own agents. proceedings.neurips.cc arXiv
AutoDev (Microsoft, 2024) – Framework for autonomous planning/execution over repos, with strong results on code and test generation; a good reference for multi-tool agent loops. arXiv Visual Studio Magazine
OpenAI: New tools for building agents (2025) – Overview of the Responses API and how to wire tools/function-calling for robust agent behavior. OpenAI

GPT-5 – the best and the greatest?

In the last few days, OpenAI announced their newest model. The model seems to be really good. In fact, it is so good that the increase from the previous ones are in only 1% in some cases (from 98% to 99%). This means that we need better benchmarks to show how the models differ.

Well, the model is really something that I want to use now, not wait until next week. If this is the latest, I do not know what the next one will be!

Is Quantum the next big thing for the masses?

But what is quantum computing? (Grover’s Algorithm)

If you are looking at the quantum computing, and you are a programmer, people start “dumbing-it-down” for you with telling about superpositions and multiple bits in one. Well, not entirely true and this is a misconception.

In this video, the author explains how quantum works, based on the mathematics. Don’t worry, it’s really approachable, without dumbing it down or mansplaining.

Thanks to my colleague who send me the video!

Do we need a large model to generate good code?

arxiv.org/pdf/2504.07343

Code generation in all forms, from solving problems to test case creation, adversarial testing, fixing security vulnerability, is super-popular in contemporary software engineering. It helps the engineers to be more efficient in their work, and it helps the managers to get more out of the resources at their disposal.

However, there is a bit of a darker side to it. We can all buy GitHub CoPilot or another add-in. We can even get the companies to set a special instance in their cloud for us. But it 1) costs and 2) uses a lot of energy, and 3) is probably a security risk as we need to send our code via an open network.

The alternative is to use publicly available models and create an add-in for the tool to use it. Fully on-site, no information even leaves premises (https://bit.ly/4iMNgrU). But, how good are these models.

In this paper, the authors studied models that are in the class that we can run on a modern desktop computer. Yes, we need a GPU for them, but a 4090 or 5090 would be enough. They tested: Llama, Gemma 2, Gemma 3, DeepSeek-R1 and Phi-4. They found that the Phi-4 model was a bit worse than the OpenAI’s GPT 03-mini-high, but it was very close. The 03-mini-high got ca. 87% correctness with pass@3, and the Phi-4 achieves 64%. Yes, not exactly the same, but still damn close.

I suggest to read the paper for everyone looking into possibilities of using these models on their own.

Requirements and AI

The last few months took a lot of my energy to transit from administrative duties to more research oriented ones. Although I like blogging a lot, there was simply no time left for that. Well, I did write and there will be a new book coming out soon, so here is a preview of what the book will be about.

Not only, this presentation shows how well we managed to develop a tool that helps one of the Software Center companies to keep market leadership in the standardizaiton and requirements.

Enjoy!

New kids on the block, or are they?

A bit of a different blog post today. I’ve just finished a course that I teach to 2nd year undergraduate students – embedded and real-time software systems. I love to see how my students grow from not knowing anything about C to programming embedded systems with interrupts, serial communication between two Arduinos and using preprocessor to implement advanced variability.

In this blog post, however, I want to write a bit about the future of software engineering as I see it. Everyone talks about AI and how it will take our jobs and reduce the need for software engineers. It will, no doubt about that. What it will not do is take the jobs of the BEST programmers on the market. If you are a great designer and software engineer, you will be even better, you will take jobs from everyone else.

This will happen only if we engage in competition. We cannot just rely on ChatGPT, DeepSeek or Manus to write our software and texts. We need to be the best programmers with these tools – faster than anyone else, more secure than anyone else and more innovative than anyone else. That means that we need to get closer to our customers. We need to understand them better than they understand themselves, and we need to do it in the ethical way – we cannot treat our customers as products, we need to treat them as people.

The same goes to our stakeholders. In my course, my stakeholders are my head of department, my dean, my boss and my students. The students are the most important ones. I am here to help them to grow, and I am priviledged when they come to my lectures, but I cannot force them. I need to make sure that I enrich their days, that they feel that my lectures are worth their while. I hope that I deliver, I see that most of them come to the lectures, most of them are happy.

We must engage in competition a bit more – the best ones must feel that they have deserved it. Otherwise, what’s the point of being the best if everyone else is also the best?

Agents, agents, better agents…

Image by Aberrant Realities from Pixabay

Introducing smolagents: simple agents that write actions in code.

In the work with generative AI, there is a constant temptation to let the AI take over and do most of the jobs. There are even ways to do that in software engineering, for example by linking the code generation with testing.

In this HuggingFace blog, the authors provide a description of an autonomous agent framework that can automate a lot of tasks. They provide a very nice description of the levels at which these agents operate, here is the table, quoted directly from the blog:

Agency Level	Description	How that’s called	Example Pattern
☆☆☆	LLM output has no impact on program flow	Simple processor	`process_llm_output(llm_response)`
★☆☆	LLM output determines basic control flow	Router	`if llm_decision(): path_a() else: path_b()`
★★☆	LLM output determines function execution	Tool call	`run_function(llm_chosen_tool, llm_chosen_args)`
★★★	LLM output controls iteration and program continuation	Multi-step Agent	`while llm_should_continue(): execute_next_step()`
★★★	One agentic workflow can start another agentic workflow	Multi-Agent	`if llm_trigger(): execute_agent()`

Source: HuggingFace

I like the model and I’ve definitely done level one and two, maybe parts of level three. With this framework, you can do level three very easily, so I recommend to take a look at that.

Maybe, this will be the topic of the next Hackathon we do at Software Center, who knows… there is one coming up on March 20th.

AI, AI and one more time AI

CES keynote from Nvidia’s CEO

AI has transformed the way we develop software and create new products. It is here to stay and it will just grow bigger. This year, one of the important events is CES where the Nvidia’s CEO shows the latest developments.

Well, no surprise that generative AI is the key. Generating frames, worlds, programs, dialogs, agents, anything basically. The newest GPUs generate 33 million pixels out of 2 million real ones. It’s tremendous improvements compared to the previous generation (4x improvement).

The coolest announcement is actually not the hardware but software. The world models instead of language models are probably the coolest software part. Being able to tokenize any kind of modality and make the model generative leads to really innovative areas. Generating new driving scenarios, training robots to imitate the best cooks, drivers, artists are only a few of the examples.

And finally – robots, robots and robots. According to the keynote, this is the technology that is on the verge of becoming mainstream. Humanoid robots that allow for brown field development is the key development here.

Now, the keynote is a bit long, but it’s definitely worth looking at.

Let’s make 2025 an Action Research year!

Image by Haeruman from Pixabay

Guidelines for Conducting Action Research Studies in Software Engineering

Happy 2025! Let’s make it a great year full of fantastic research results and great products. How to achieve that goal? Well, let’s take a look at this paper about guidelines for conducting action research.

These guidelines are based on my experiences with working as software engineer. I’ve started my career in industry and even after moving to academia I stayed close to the action – where software gets done. Reflecting on the previous years, I’ve looked at my GitHub profile and realized that only two repositories are used in industry. Both are used by my colleagues from Software Center, who claim that this software provided them with new, cool possibilities. I need to create more of this kind of impact in 2025.

Let’s make 2025 an Action research year!