We Tried to Measure AI's Impact on Codebases. Here's Why It's So Hard.

There's a stat floating around that AI coding assistants make developers "55% faster." Another one says "40% more code." A third claims "3x productivity." They get dropped into pitch decks and blog posts without much scrutiny, and everyone nods along because the conclusion feels right — AI tools do feel helpful when you're using them.

But when we actually tried to measure AI's impact on real codebases — looking at commit histories, PR patterns, contributor behaviour, and code quality signals — we ran into a wall. Not because AI isn't changing how people write code. It clearly is. But because measuring how much and whether it's good turns out to be a genuinely hard problem that most people are hand-waving past.

Here's what we found.

The Obvious Metric Is the Wrong One

The first instinct is to measure output. More lines of code, more commits, more PRs merged per week. If a developer adopted Copilot in March and their commit frequency doubled by April, that's the AI working, right?

Not necessarily.

Lines of code has been a discredited productivity metric for decades, and AI doesn't magically rehabilitate it. If anything, AI makes it worse. A developer who lets an LLM generate a 200-line utility function they would have written in 40 lines hasn't become 5x more productive — they've added 160 lines of future maintenance burden. The codebase got bigger. Whether it got better is a completely separate question.

We looked at repositories where teams had visibly adopted AI tooling (detectable through commit message patterns, PR descriptions referencing AI assistance, and characteristic code patterns). Commit volume did tend to increase. But so did several less flattering metrics:

Average commit size grew. More lines changed per commit, which correlates with harder-to-review changes and higher defect rates.
PR cycle times didn't improve — and sometimes got worse. Reviewers were spending longer on PRs, likely because the code was less familiar (it wasn't written in the author's usual style) or because there was simply more of it to review.
Commit message quality dropped. More generic descriptions. More "add feature" and "update logic" without context. When the code is generated, developers seem less inclined to explain why.

None of this means AI is making things worse. It means that raw output metrics don't capture what's actually happening, and optimising for them leads you somewhere misleading.

The Before/After Problem

The cleanest way to measure AI impact would be a controlled experiment: same team, same project, same time pressure, with and without AI tools. But nobody works that way. You can't rewind a codebase. You can't un-know what Copilot would have suggested.

What you can do is compare time periods — before adoption and after. But this introduces a cascade of confounding variables:

The project itself changed. Maybe the team shipped a major release and moved into a maintenance phase. Maybe they onboarded three new contributors. Maybe the architecture shifted from monolith to microservices. Any of these would change your metrics regardless of AI.
The team changed. People join, leave, get promoted, switch focus. The humans are a bigger variable than the tools.
Tooling changed. AI adoption rarely happens in isolation. Teams that adopt Copilot are often simultaneously upgrading their CI, migrating to a new framework, or adopting TypeScript. Good luck isolating the AI signal from the noise.
Novelty effects. Developers who just got a new tool use it enthusiastically. They generate more code because generating code is now fun. That enthusiasm fades. The sustained impact looks different from the initial spike.

We found that almost every "AI made us X% faster" claim, when you actually inspect the methodology, is comparing a period of enthusiastic adoption against a baseline that wasn't controlled for any of the above. It's not lying, exactly. But it's not science either.

What AI Actually Changes (That You Can Measure)

If gross output metrics are misleading and before/after comparisons are confounded, what can you actually observe?

After looking at hundreds of repositories, we think the measurable impacts fall into a few categories — and they're more nuanced than the headlines suggest.

Code homogeneity increases

AI-assisted codebases tend to become more internally consistent over time. The same patterns get repeated. The same abstractions get used. This is partly good — consistency reduces cognitive load. But it also means the codebase can develop a kind of monoculture. The same mistakes get replicated everywhere. The same suboptimal patterns become load-bearing before anyone notices.

You can detect this by looking at file similarity metrics and the ratio of novel patterns versus repeated structures across commits. We've seen repos where 60%+ of new code in a month is structurally near-identical to existing code — a level of repetition that's unusual in purely human-authored codebases.

The "first draft" problem

AI is excellent at producing a first draft. It's less good at producing the right draft. In practice, this means developers spend less time writing code and more time editing, reviewing, and course-correcting code that an AI wrote.

This isn't captured in any commit-level metric. The commit lands looking clean. But the developer spent 20 minutes coaxing the AI toward the right solution, then another 10 minutes removing the unnecessary abstraction it added, then another 5 minutes fixing the edge case it missed. The commit history says "30 minutes, 80 lines." The reality was "35 minutes, and the developer could have written it in 25 without the AI, but in 45 without any prior knowledge of the library."

The AI helped. But not by the amount the metrics suggest.

Review burden shifts

This is the one we found most consistently across repositories. When AI generates more code, someone still has to review it. And reviewing AI-generated code is different from reviewing human-written code:

The reviewer can't rely on familiarity with the author's style to quickly scan for intent.
AI-generated code tends to be "correct-looking" — it follows conventions, uses proper variable names, handles obvious edge cases — which makes subtle bugs harder to spot. The code looks right even when it isn't.
Larger PRs from AI-assisted development mean more surface area to review, but the same (or fewer) reviewers.

In several repositories we analysed, the introduction of AI tools coincided with a measurable increase in PR cycle time and a decrease in review thoroughness (measured by review comment density per line changed). The bottleneck didn't disappear. It moved from writing to reviewing.

Test coverage becomes more important (but doesn't always increase)

If you're generating code faster, your test infrastructure needs to keep up. In theory, AI also helps write tests. In practice, we see a common pattern: AI-assisted PRs ship more production code but proportionally less test code than the pre-AI baseline.

This isn't universal — some teams are disciplined about this. But the temptation is clear. The AI generates the feature code quickly, and the developer ships it while their enthusiasm is high, planning to "add tests later." The commit history is littered with these deferred promises.

The Metrics Nobody Wants to Talk About

Here's where it gets uncomfortable.

Knowledge erosion

When developers lean heavily on AI for code generation, they sometimes stop building deep familiarity with the codebase. We've observed repositories where contributor breadth appears to increase (more people touching more files) but contributor depth decreases (fewer people who deeply understand any given module).

This shows up as a subtle shift in the bus factor calculation. Traditionally, a bus factor of 1 means one person dominates a file or module. With AI-assisted development, you can have a bus factor of 3 where none of the three contributors could confidently explain the code without re-reading it. The metric looks healthy. The reality is fragile.

You can't measure this from commit data alone. But you can detect proxy signals: increased frequency of "refactor" commits shortly after AI-assisted feature commits (suggesting the initial code wasn't well-understood), higher revert rates, and more "fix" commits in the 48 hours following AI-heavy PRs.

Dependency on generation context

AI-generated code is only as good as the context window it had when generating. When that context is wrong or incomplete — which happens silently — the resulting code can be subtly misaligned with the project's actual architecture.

We've seen this manifest as a slow drift in code style and architectural patterns within a single repository. The codebase starts developing "dialects" — sections that use different patterns for the same operations, because different AI sessions had different context when generating them. Over months, this creates a maintenance burden that's invisible in any single commit but obvious when you zoom out.

The productivity paradox

Perhaps the most uncomfortable finding: in repositories with heavy AI tool adoption, we consistently see more code but not proportionally more capability. Features get built. But the ratio of code volume to user-facing functionality trends upward. The codebase grows faster than the product does.

This isn't unique to AI — it's the classic "accidental complexity" problem that software engineering has always struggled with. But AI accelerates it, because the cost of writing code drops toward zero while the cost of understanding, maintaining, and evolving that code stays constant.

So How Should You Actually Measure This?

After all of this, we think the honest answer is: you probably can't measure AI's impact with a single metric or a simple before/after comparison. But you can track a set of signals that, taken together, give you a reasonable picture:

Track what matters, not what's easy to count:

PR cycle time (not PR volume)
Review comment density (not approval speed)
Revert and hotfix rates (not commit counts)
Test-to-production code ratio over time
Contributor depth per module (not just breadth)
Commit message quality trends

Watch for warning signs:

Commit sizes trending upward without corresponding feature complexity
Declining review thoroughness as code volume increases
Growing gap between code output and test coverage
Increasing frequency of "fix the thing we just shipped" commits
Bus factor metrics that look healthy but mask shallow familiarity

Be honest about what you're measuring and why:

If your goal is "prove AI is worth the license cost," you'll find the metrics to support that conclusion. That's confirmation bias, not measurement.
If your goal is "understand how our development process is actually changing," you need to look at the uncomfortable metrics too.

The Bigger Picture

AI coding tools are genuinely useful. We use them ourselves. They reduce the friction of getting started, help with unfamiliar APIs, and handle boilerplate that nobody enjoys writing. The developers who use them effectively treat them as a drafting aid — not an autopilot — and maintain their own understanding of what's being generated.

But the industry's rush to quantify AI's impact has produced a lot of misleading numbers. "55% faster" doesn't mean 55% more productive. More code doesn't mean better code. And a metric that goes up isn't automatically a metric that matters.

The real impact of AI on a codebase is visible — but only if you're looking at the right signals, over the right timescale, with the right context. It shows up in how PR review patterns shift, how contributor knowledge distributes (or doesn't), how test coverage keeps pace (or doesn't), and how code complexity evolves relative to product complexity.

These are the kinds of signals we think about constantly at RepoShark. Our health scoring and risk detection were built to surface exactly these patterns — the slow shifts in repository health that are invisible in any single commit but obvious when you have the data to zoom out. Whether AI is part of your workflow or not, the question is the same: is this codebase getting healthier or quietly degrading? The answer is always in the patterns.

If you want to see what your repository's patterns actually look like — commit distribution, contributor depth, PR cycle times, risk signals — try a free analysis. No setup, no config. Just paste a repo URL.