The Synthetic Engineer: Measuring the Real Impact of AI on Software Delivery

https://miroslawstaron.github.io/hallucinations.html#/5

The shift from manual coding to AI-augmented orchestration is no longer a future – it is a reality. Software engineers adopt AI increasingly often and increasingly deep.

However, as organizations pour investment into Generative AI tools, a critical question remains: How do we measure the true return on investment?

I asked Gemini to analyze the DORA report and look at the internet to find how people measure AI adoption. Its report, Evaluating the Synthetic Engineer, suggests that we must move beyond vanity metrics like “lines of code generated.” When code generation is cheap, we need to think about the adoption and design.

I’ve recently heard that one company paid an eqiovalent of three software engineers worth of tokens to Anthropic, for a seven-person team. This means that effectively, 30% of the entire team (3+7) was AI. This is really cool and it shows that this reality is here. How do we measure that these tokens were not just wasted, though?

The Velocity-Quality Tension

The most immediate effect of AI is a spike in velocity. Teams often see a 15–25% reduction in Cycle Time and significantly accelerated onboarding—reducing the “Time to 10th PR” from 91 days to just 33.

However, this speed comes with a hidden cost: Comprehension Debt. The report highlights that AI-assisted code often results in higher defect density and a rework rate that can double the human baseline. To manage this, we must align AI metrics with the industry-standard DORA metrics to ensure that speed doesn’t break the system.

Integrated Metrics Framework

To truly evaluate the AI organizations should track a mix of telemetry-based system data and survey-based human sentiment.

Category	Metric	Measurement Source / Context
DORA (System)	Deployment Frequency	CI/CD Pipeline / Release logs
DORA (System)	Lead Time for Changes	Version Control / Deployment logs
DORA (System)	Change Failure Rate	Incident Management / CI/CD logs
DORA (System)	Recovery Time (MTTR)	Incident Management / Pager logs
AI Use	Acceptance Rate	IDE Plugin Telemetry
AI Use	AI Interaction Time	Tool Telemetry / Browser logs
AI Effect	Rework Rate	Jira / Commit history
Human	Trust & Reliance	Developer Surveys (Confidence in AI)
Human	Job Satisfaction	Developer Surveys (Burnout vs. Flow)

Now, we can compare that to the DORA metrics that are used widely in industry today. There, we have two parts, the telemetry based ones:

Metric	Definition	Measurement Source
Deployment Frequency	How often the team successfully releases to production.	CI/CD Pipeline / Release logs
Lead Time for Changes	Time from code commit to code successfully running in production.	Version Control / Deployment logs
Change Failure Rate	% of deployments causing a failure in production (requiring a fix/rollback).	Incident Management / CI/CD logs
Failed Deployment Recovery Time	How long it takes to restore service after a failure in production.	Incident Management / Pager logs
Rework Rate	The percentage of work time spent on unplanned fixes or bugs.	Ticket tracking (Jira) / Commit history
Acceptance Rate	The ratio of AI-generated code suggestions that are actually kept in the file.	IDE Plugin Telemetry
Commit/PR Volume	The raw count of code changes and pull requests submitted.	Version Control Systems (VCS)
AI Interaction Time	The actual duration of time spent interacting with an AI interface.	Tool Telemetry / Browser logs
Code Stability	The frequency of breaks or regressions in the automated test suite.	Testing Frameworks / Build logs

And then the ones that are measuring perceptions, based on surveys:

Metric	Definition	Context for Use
Trust	The degree of confidence a developer has in the accuracy and safety of AI output.	To identify if developers are “blindly” following AI or if skepticism is hindering adoption.
Reflexive Use	How instinctively a developer turns to AI when a new problem arises.	To measure the behavioral shift in problem-solving habits.
Reliance	The self-assessed level of dependency on AI tools to complete daily work.	To monitor for potential skill atrophy or high-dependency risks.
Individual Effectiveness	Perceived productivity, impact on the organization, and ability to stay “in flow.”	To assess the “value-add” from the developer’s own perspective.
Job Satisfaction	The level of fulfillment and contentment a developer feels in their role.	To ensure that AI automation is improving work life rather than creating “toil.”
Burnout	Physical or mental exhaustion caused by work-related stress.	To monitor if the increased “instability” caused by AI is taxing the team.
Personal Ownership	The psychological feeling of “owning” the code and its quality.	To prevent the dilution of accountability when AI generates a high volume of code.
User-Centric Focus	The extent to which the team prioritizes end-user needs in their workflow.	Used as a “multiplier” to see if AI speed is being directed at the right goals.

I recommend picking out some of these metrics and sticking to them. I personally prefer telemetry-based metrics because they provide more value than filling out a survey. Survey-based metrics should be used sparingly, as they provide more of a temperature reading for an organization.

Author: Miroslaw Staron

I’m professor in Software Engineering at Computer Science and Engineering. I usually blog about interesting articles (for me) and my own reflections on the development of Software Engineering, AI, computer science and automotive software. View all posts by Miroslaw Staron