
How Do You Measure AI? | Communications of the ACM
Due to my background in software metrics, I’ve been interested about measurement of AI systems for a while. What I found is that there are benchmarks and suites of metrics used for measurement of AI. But….
When GPT-5 was announced, most of the metrics that they showed improved by 1-2%. They improved from 97-99%, which made me wonder whether we are so perfect or whether we need new ways of measuring generative AI systems.
As I see it, we need new metrics and new benchmarks. I like the “humanity’s last exam” benchmark, because it is still not saturated. It is a great benchmark, but if we have a perfect score on that benchmark, will it be able to constuct good generative AI software? Or will we make software that is very good in theory and not useful in practice?
In this article, the authors offer an opinion on this topic, supporting my view and also indicating that source code generation is one of the areas where metrics are getting more mature than in others. CodeBLEU and CodeROUGE are better than their non-code correspondence. This is because the take domain knowledge into the consideration.
Let’s see what new benchmarks will pop up when GPT-5 becomes even more popular.