The Synthetic Engineer: Measuring the Real Impact of AI on Software Delivery

https://miroslawstaron.github.io/hallucinations.html#/5

The shift from manual coding to AI-augmented orchestration is no longer a future – it is a reality. Software engineers adopt AI increasingly often and increasingly deep.

However, as organizations pour investment into Generative AI tools, a critical question remains: How do we measure the true return on investment?

I asked Gemini to analyze the DORA report and look at the internet to find how people measure AI adoption. Its report, Evaluating the Synthetic Engineer, suggests that we must move beyond vanity metrics like “lines of code generated.” When code generation is cheap, we need to think about the adoption and design.

I’ve recently heard that one company paid an eqiovalent of three software engineers worth of tokens to Anthropic, for a seven-person team. This means that effectively, 30% of the entire team (3+7) was AI. This is really cool and it shows that this reality is here. How do we measure that these tokens were not just wasted, though?

The Velocity-Quality Tension

The most immediate effect of AI is a spike in velocity. Teams often see a 15–25% reduction in Cycle Time and significantly accelerated onboarding—reducing the “Time to 10th PR” from 91 days to just 33.

However, this speed comes with a hidden cost: Comprehension Debt. The report highlights that AI-assisted code often results in higher defect density and a rework rate that can double the human baseline. To manage this, we must align AI metrics with the industry-standard DORA metrics to ensure that speed doesn’t break the system.

Integrated Metrics Framework

To truly evaluate the AI organizations should track a mix of telemetry-based system data and survey-based human sentiment.

CategoryMetricMeasurement Source / Context
DORA (System)Deployment FrequencyCI/CD Pipeline / Release logs
DORA (System)Lead Time for ChangesVersion Control / Deployment logs
DORA (System)Change Failure RateIncident Management / CI/CD logs
DORA (System)Recovery Time (MTTR)Incident Management / Pager logs
AI UseAcceptance RateIDE Plugin Telemetry
AI UseAI Interaction TimeTool Telemetry / Browser logs
AI EffectRework RateJira / Commit history
HumanTrust & RelianceDeveloper Surveys (Confidence in AI)
HumanJob SatisfactionDeveloper Surveys (Burnout vs. Flow)

Now, we can compare that to the DORA metrics that are used widely in industry today. There, we have two parts, the telemetry based ones:

MetricDefinitionMeasurement Source
Deployment FrequencyHow often the team successfully releases to production.CI/CD Pipeline / Release logs
Lead Time for ChangesTime from code commit to code successfully running in production.Version Control / Deployment logs
Change Failure Rate% of deployments causing a failure in production (requiring a fix/rollback).Incident Management / CI/CD logs
Failed Deployment Recovery TimeHow long it takes to restore service after a failure in production.Incident Management / Pager logs
Rework RateThe percentage of work time spent on unplanned fixes or bugs.Ticket tracking (Jira) / Commit history
Acceptance RateThe ratio of AI-generated code suggestions that are actually kept in the file.IDE Plugin Telemetry
Commit/PR VolumeThe raw count of code changes and pull requests submitted.Version Control Systems (VCS)
AI Interaction TimeThe actual duration of time spent interacting with an AI interface.Tool Telemetry / Browser logs
Code StabilityThe frequency of breaks or regressions in the automated test suite.Testing Frameworks / Build logs

And then the ones that are measuring perceptions, based on surveys:

MetricDefinitionContext for Use
TrustThe degree of confidence a developer has in the accuracy and safety of AI output.To identify if developers are “blindly” following AI or if skepticism is hindering adoption.
Reflexive UseHow instinctively a developer turns to AI when a new problem arises.To measure the behavioral shift in problem-solving habits.
RelianceThe self-assessed level of dependency on AI tools to complete daily work.To monitor for potential skill atrophy or high-dependency risks.
Individual EffectivenessPerceived productivity, impact on the organization, and ability to stay “in flow.”To assess the “value-add” from the developer’s own perspective.
Job SatisfactionThe level of fulfillment and contentment a developer feels in their role.To ensure that AI automation is improving work life rather than creating “toil.”
BurnoutPhysical or mental exhaustion caused by work-related stress.To monitor if the increased “instability” caused by AI is taxing the team.
Personal OwnershipThe psychological feeling of “owning” the code and its quality.To prevent the dilution of accountability when AI generates a high volume of code.
User-Centric FocusThe extent to which the team prioritizes end-user needs in their workflow.Used as a “multiplier” to see if AI speed is being directed at the right goals.

I recommend picking out some of these metrics and sticking to them. I personally prefer telemetry-based metrics because they provide more value than filling out a survey. Survey-based metrics should be used sparingly, as they provide more of a temperature reading for an organization.

Author: Miroslaw Staron

I’m professor in Software Engineering at Computer Science and Engineering. I usually blog about interesting articles (for me) and my own reflections on the development of Software Engineering, AI, computer science and automotive software.