You can't trust agent tests

I used an agent to migrate 53 Enzyme test suites to React Testing Library. The tests passed and the code looked coherent, however when we took a closer look we found tests that wouldn't catch regressions, that asserted on the wrong things and some that didn't really test anything.

With agent-generated tests this is easy to miss as the output looks polished, the code is clean and there's a lot of it.

Enzyme to RTL migration

The Enzyme tests were technical debt sitting on a backlog as low priority for years. This is exactly the kind of well-defined, mechanical work agents are supposed to be good at. I chunked it by complexity, used my prepare-mr-skill to structure the commits and write descriptions and it worked. At least it looked like it worked: the tests were migrated, they passed, and the code changes looked reasonable.

Before merging, we ran the MRs through our usual team review process and spotted tests that looked correct but wouldn't catch regressions. For example, a test asserting that an element with a specific ARIA role wasn't rendered, where the role was invalid and that element could never have been rendered by the component in the first place. There was also confusion between disabled and aria-disabled, two attributes that behave differently and matter for accessibility but look similar enough that a plausible-looking test can get them wrong without failing.

Why this happens

The problem isn't unique to agents, it's easy to write tests like this by hand too. However a human writing a bad assertion usually knows the component and might notice the test feels too easy. An agent has no intuition, it pattern-matches from the implementation to produce something that looks right and moves on. Also, the volume of code that agents can produce make it easy for issues like this to slip through.

When the implementation already exists (eg migrating tests, backfilling coverage, validating tests written years ago) the natural safeguard of Test Driven Development (TDD) doesn't apply. You're writing tests against working code, so a bad assertion never has the chance to fail. The only way to catch it is to break the component on purpose: make a temporary change to the component, confirm the test catches it, revert.

Getting validation to stick

My first instinct was to fix the prompt to force the agent to check for these types of mistakes. I updated my RTL migration skill to explicitly require that every test be validated against a failing state.

It didn't reliably work. The agent would consider the requirement, sometimes catch something, and move on. It treated validation as a checkbox rather than a constraint.

Part of the problem is structural. When you ask an agent to migrate a large number of tests in a single pass, the task is long and the context fills up. As context grows, models start to lose focus from your instructions. Some models also display "context anxiety" where they begin to wrap up work early as they approach the end of their context window. This drives agents to start to forget or skip these sorts of validation instructions.

I tried running the migration per test suite rather than in one pass, thinking a shorter context window would reduce the pressure to complete and give the agent more room to slow down on validation steps. The agent was more likely to check, but would often decide to validate "a few key tests" rather than every single one.

Finally, I looked to split the migration and review steps into two separate agent tasks. I wrote a bash script that looped through each test suite and ran two sequential agent prompts: one to migrate, one to review.

#!/bin/bash
 
echo "Starting migration..."
 
declare -a files=(
  "path/to/test-suite.test.tsx"
)
 
for i in "${files[@]}"; do
  echo "Migrating $i"
 
  agent -p --force --output-format stream-json --stream-partial-output --model claude-4.6-sonnet-medium \
    "Migrate $i to use RTL. Ensure linting passes and then commit your changes in a single commit."
 
  agent -p --force --output-format stream-json --stream-partial-output --model claude-4.6-sonnet-medium \
    "Review the RTL tests in $i -
 
		* Identify any unnecessary tests, remove them
		* Identify any tests not following best practices, fix them
		* For every single test intentionally change the implementation to break the specific feature tested and confirm the test fails as expected, fix them if they don't
 
		Ensure linting passes and then commit your changes in a single commit."
 
  echo "Migrated $i"
done

With a more focused validation task the agent created its own to-do list and worked through each test exhaustively, catching issues and making some improvements beyond what was caught in the review.

What it cost

That worked, but also cost significantly more. All migrations were ran through Cursor using claude-4.6-sonnet-medium on a batch of 22 test suites:

Approach	Cost
Single-pass batch	~$13
Per-suite	~$32
Per-suite + reviewer	~$62

The per-suite reviewer pass resolved every issue that human reviewers had previously caught, but was an almost 5x increase in cost compared to the single-pass batch.

Agents enable us to do more, completing modernisation work that would have just sat on a backlog. But doing it well, with results we can trust is more expensive than we might expect. I think if we're going to do something we need to do it well, but there's a trade-off here around the real monetary cost of this work even if the time required is greatly reduced.

Organisations need to be intentional in how this increased productivity leads to increased revenue, and not just increased costs on AI vendor bills.

The takeaway

The two-prompt script wasn't a better prompt, it was a different architecture. When you ask one agent to migrate and validate in the same pass, the task structure works against validation. When a process step consistently gets dropped, the fix usually isn't a better instruction. It's a workflow where that step can't be skipped.

What the reviewer is doing here, conceptually, is mutation testing: for each assertion, asking whether a change to the component would cause it to fail. Incorporating a mutation testing tool like Stryker into your testing pipeline could help catch these issues and reduce cost by avoiding having an agent do it manually.

Time spent designing agent workflows is part of the work and should be considered when evaluating productivity gains from using agents. It's the difference between output you can trust and output that just looks like you can.

You can't trust agent tests

Enzyme to RTL migration

Why this happens

Getting validation to stick

What it cost

The takeaway

Related posts

Moving fast with agents without losing comprehension

AI agents and platform teams

React design system library MCP

Get new posts by email