Measuring before tooling: where AI investment in engineering should go

DX recently published results from a 16-month longitudinal study across 400+ engineering organizations: AI tool usage rose by 65%, while median PR throughput rose by just under 8%. Most organizations landed in the 5-15% range. Meaningful, but well short of the 3x or 10x being promised by some vendors.

Brian Houck, who leads developer productivity research at Microsoft, offered an explanation: coding accounts for around 14% of a developer's workweek. Even significant gains in the coding portion only recover so much.

The other 86%, things like review, planning, debugging, context-switching, stakeholder back-and-forth, the rework that happens when something gets built before it's understood, doesn't get faster because writing the code did.

This isn't an argument against coding tools, I use them daily. But "roll out coding tools" isn't a strategy, and I've recently moved into a new area of focus at Mintel: optimising how engineering teams adopt AI tools. If my job is to help teams move faster, I need to know which tools are helping which teams, where the real bottlenecks are, and how to target investment at problems teams actually have. That's a measurement problem before it's a tooling problem.

DORA tells you what, not why

Most teams already have data that could tell them something. DORA metrics (deployment frequency, lead time for changes, change failure rate, time to restore) are a reasonable signal, but having metrics and understanding what they mean are different things.

Two years ago I built an internal dashboard pulling deployment frequency and lead time from our repository activity using the GitLab API. Our engineering managers use it, and it's given us a more grounded basis for conversations than Jira cycle time alone. But its limits became obvious quickly: it tells you lead time has slowed but it doesn't help you understand why. I've spoken to EMs who've taken to manually exporting metrics into spreadsheets to spot patterns, a clear signal that the data is there but the insight isn't.

Without measurement we're working from vibes. With measurement but without context we're working from charts.

A more grounded retrospective

To bridge that gap I've been building an agent to help bring metrics into the retro discussion, combining the quantatitive and qualitative to understand why. It pulls sprint metrics from Jira and GitLab (the current sprint and the two previous) and feeds that into an LLM, with tools to look up individual tickets and merge requests. The output is a set of hypotheses and discussion points for a team's sprint retrospective: patterns the data suggests, questions worth asking, threads worth pulling on.

It isn't automating the retrospective. A retrospective is a team reflecting on its own experience, and that process has value because it's human. It builds shared understanding, surfaces things that don't show up in metrics, and creates the psychological safety that makes honest conversation possible. An algorithm can't replace that.

What it can do is take a question and start pulling on the thread. A few examples from real sprints:

Cycle time spiked. The agent walks through how those higher-cycle-time tickets moved through the workflow and surfaces that a lot of them bounced in and out of "blocked" more than once. Not "cycle time is up," but "cycle time is up, and a lot of those tickets sat in blocked repeatedly - was there an external dependency we didn't account for in planning?"
One ticket dragged on for three weeks. The agent follows the linked MRs and finds the work fanned out across three services. Not "this was a slow ticket," but "this ticket touched three services - is the change boundary in the right place, or are we paying a coupling tax that's going to keep showing up?"
An MR sat in review for a week. The agent looks at the diff and finds 40+ files changed. Not "review is slow," but "this MR was probably too large to review well - is there a pattern of changes being bundled that should be split?"

These are still hypotheses. The team has to evaluate them, push back, decide which ones ring true. The goal is better questions backed by data at the start of a human-led conversation, where engineers with the full context can reflect on what's really going on.

Agents beyond authoring

Almost every conversation about AI in engineering defaults to writing code. The DX data suggests the productivity gain is real but bounded. If the other 86% of the work is where most of the time goes, that's where there's room for productivity gains.

A retrospective agent isn't writing code. It's helping a team see where their time is going and ask better questions about it. If it works, it's one use case showing that the most interesting agent work might not look anything like coding agents.

This is the first in a series. The next post covers how the agent works: data sources, prompting approach, and where it gets tricky. After that: how we rolled it out, and what we've learned from real usage.

Measuring before tooling: where AI investment in engineering should go

DORA tells you what, not why

A more grounded retrospective

Agents beyond authoring

Related posts

AI agents and platform teams

Using GitLab API to create a DORA metrics dashboard

Moving fast with agents without losing comprehension

Get new posts by email