
https://miroslawstaron.github.io/hallucinations.html#/5
The shift from manual coding to AI-augmented orchestration is no longer a future – it is a reality. Software engineers adopt AI increasingly often and increasingly deep.
However, as organizations pour investment into Generative AI tools, a critical question remains: How do we measure the true return on investment?
I asked Gemini to analyze the DORA report and look at the internet to find how people measure AI adoption. Its report, Evaluating the Synthetic Engineer, suggests that we must move beyond vanity metrics like “lines of code generated.” When code generation is cheap, we need to think about the adoption and design.
I’ve recently heard that one company paid an eqiovalent of three software engineers worth of tokens to Anthropic, for a seven-person team. This means that effectively, 30% of the entire team (3+7) was AI. This is really cool and it shows that this reality is here. How do we measure that these tokens were not just wasted, though?
The Velocity-Quality Tension
The most immediate effect of AI is a spike in velocity. Teams often see a 15–25% reduction in Cycle Time and significantly accelerated onboarding—reducing the “Time to 10th PR” from 91 days to just 33.
However, this speed comes with a hidden cost: Comprehension Debt. The report highlights that AI-assisted code often results in higher defect density and a rework rate that can double the human baseline. To manage this, we must align AI metrics with the industry-standard DORA metrics to ensure that speed doesn’t break the system.
Integrated Metrics Framework
To truly evaluate the AI organizations should track a mix of telemetry-based system data and survey-based human sentiment.
| Category | Metric | Measurement Source / Context |
|---|---|---|
| DORA (System) | Deployment Frequency | CI/CD Pipeline / Release logs |
| DORA (System) | Lead Time for Changes | Version Control / Deployment logs |
| DORA (System) | Change Failure Rate | Incident Management / CI/CD logs |
| DORA (System) | Recovery Time (MTTR) | Incident Management / Pager logs |
| AI Use | Acceptance Rate | IDE Plugin Telemetry |
| AI Use | AI Interaction Time | Tool Telemetry / Browser logs |
| AI Effect | Rework Rate | Jira / Commit history |
| Human | Trust & Reliance | Developer Surveys (Confidence in AI) |
| Human | Job Satisfaction | Developer Surveys (Burnout vs. Flow) |
Now, we can compare that to the DORA metrics that are used widely in industry today. There, we have two parts, the telemetry based ones:
| Metric | Definition | Measurement Source |
| Deployment Frequency | How often the team successfully releases to production. | CI/CD Pipeline / Release logs |
| Lead Time for Changes | Time from code commit to code successfully running in production. | Version Control / Deployment logs |
| Change Failure Rate | % of deployments causing a failure in production (requiring a fix/rollback). | Incident Management / CI/CD logs |
| Failed Deployment Recovery Time | How long it takes to restore service after a failure in production. | Incident Management / Pager logs |
| Rework Rate | The percentage of work time spent on unplanned fixes or bugs. | Ticket tracking (Jira) / Commit history |
| Acceptance Rate | The ratio of AI-generated code suggestions that are actually kept in the file. | IDE Plugin Telemetry |
| Commit/PR Volume | The raw count of code changes and pull requests submitted. | Version Control Systems (VCS) |
| AI Interaction Time | The actual duration of time spent interacting with an AI interface. | Tool Telemetry / Browser logs |
| Code Stability | The frequency of breaks or regressions in the automated test suite. | Testing Frameworks / Build logs |
And then the ones that are measuring perceptions, based on surveys:
| Metric | Definition | Context for Use |
| Trust | The degree of confidence a developer has in the accuracy and safety of AI output. | To identify if developers are “blindly” following AI or if skepticism is hindering adoption. |
| Reflexive Use | How instinctively a developer turns to AI when a new problem arises. | To measure the behavioral shift in problem-solving habits. |
| Reliance | The self-assessed level of dependency on AI tools to complete daily work. | To monitor for potential skill atrophy or high-dependency risks. |
| Individual Effectiveness | Perceived productivity, impact on the organization, and ability to stay “in flow.” | To assess the “value-add” from the developer’s own perspective. |
| Job Satisfaction | The level of fulfillment and contentment a developer feels in their role. | To ensure that AI automation is improving work life rather than creating “toil.” |
| Burnout | Physical or mental exhaustion caused by work-related stress. | To monitor if the increased “instability” caused by AI is taxing the team. |
| Personal Ownership | The psychological feeling of “owning” the code and its quality. | To prevent the dilution of accountability when AI generates a high volume of code. |
| User-Centric Focus | The extent to which the team prioritizes end-user needs in their workflow. | Used as a “multiplier” to see if AI speed is being directed at the right goals. |
I recommend picking out some of these metrics and sticking to them. I personally prefer telemetry-based metrics because they provide more value than filling out a survey. Survey-based metrics should be used sparingly, as they provide more of a temperature reading for an organization.