You can't trust agent tests
Posted 9 April 2026 · 5 min read
I used an agent to migrate 53 Enzyme test suites to React Testing Library. The tests passed and the code looked coherent, however when we took a closer look we found tests that wouldn't catch regressions, that asserted on the wrong things and some that didn't really test anything.
With agent-generated tests this is easy to miss as the output looks intentional, the code is clean and there's a lot of it.
Enzyme to RTL migration
The Enzyme tests were technical debt sitting on a backlog as low priority for years. This is exactly the kind of well-defined, mechanical work agents are supposed to be good at. I chunked it by complexity, used my prepare-mr-skill to structure the commits and write descriptions and it worked. At least it looked like it worked: the tests were migrated, they passed, and the code changes looked reasonable.
Before merging, we ran the MRs through our usual team review process and spotted tests that looked correct but wouldn't actually catch regressions. For example, a test asserting that an element with a specific ARIA role wasn't rendered, where the role was invalid and that element could never have been rendered by the component in the first place. There was also confusion between disabled and aria-disabled, two attributes that behave differently and matter for accessibility but look similar enough that a plausible-looking test can get them wrong without failing.
Why this happens
The problem isn't unique to agents, it's easy to write tests like this by hand too. However a human writing a bad assertion usually knows the component and might notice the test feels too easy. An agent has no intuition, it pattern-matches from the implementation to produce something that looks right and moves on. Also, the volume of code that agents can produce make it easy for issues like this to slip through.
When the implementation already exists (eg migrating tests, backfilling coverage, validating tests written years ago) the natural safeguard of Test Driven Development (TDD) doesn't apply. You're writing tests against working code, so a bad assertion never has the chance to fail. The only way to catch it is to deliberately introduce a failure: make a temporary change to the component, confirm the test catches it, revert.
Getting an agent to actually do it
My first instinct was to fix the prompt to force the agent to check for these types of mistakes. I updated my RTL migration skill to explicitly require that every test be validated against a failing state.
It didn't reliably work. The agent would consider the requirement, sometimes catch something, and move on. It treated validation as a checkbox rather than a constraint.
Part of the problem is structural. When you ask an agent to migrate a large number of tests in a single pass, the task is long and the context fills up. As context grows, models start to lose focus from your instructions. Some models also display "context anxiety" where they begin to wrap up work early as they approach the end of their context window. This drives agents to start to forget or skip these sorts of validation instructions.
I tried running the migration per test suite rather than in one pass, thinking a shorter context window would reduce the pressure to complete and give the agent more room to slow down on validation steps. The agent was more likely to check, but would often decide to validate "a few key tests" rather than every single one.
Finally, I looked to split the migration and review steps into two separate agent tasks. I wrote a bash script that looped through each test suite and ran two sequential agent prompts: one to migrate, one to review.
#!/bin/bash
echo "Starting migration..."
declare -a files=(
"path/to/test-suite.test.tsx"
)
for i in "${files[@]}"; do
echo "Migrating $i"
agent -p --force --output-format stream-json --stream-partial-output --model claude-4.6-sonnet-medium \
"Migrate $i to use RTL. Ensure linting passes and then commit your changes in a single commit."
agent -p --force --output-format stream-json --stream-partial-output --model claude-4.6-sonnet-medium \
"Review the RTL tests in $i -
* Identify any unnecessary tests, remove them
* Identify any tests not following best practices, fix them
* For every single test intentionally change the implementation to break the specific feature tested and confirm the test fails as expected, fix them if they don't
Ensure linting passes and then commit your changes in a single commit."
echo "Migrated $i"
doneWith a more focused validation task the agent created its own to-do list and worked through each test exhaustively, catching issues and making some improvements beyond what was caught in the review.
What it cost
That worked, but also cost significantly more. All migrations were ran through Cursor using claude-4.6-sonnet-medium on a batch of 22 test suites:
| Approach | Cost |
|---|---|
| Single-pass batch | ~$13 |
| Per-suite | ~$32 |
| Per-suite + reviewer | ~$62 |
The per-suite reviewer pass resolved every issue that human reviewers had previously caught, but was an almost 5x increase in cost compared to the single-pass batch.
Agents enable us to do more, completing modernisation work that would have just sat on a backlog. But doing it well, with results we can trust is more expensive than we might expect. I think if we're going to do something we need to do it well, but there's a trade-off here around the real monetary cost of this work even if the time required is greatly reduced.
Organisations need to be intentional in how this increased productivity leads to increased revenue, and not just increased costs on AI vendor bills.
The takeaway
The two-prompt script wasn't a better prompt, it was a different architecture. When you ask one agent to migrate and validate in the same pass, the task structure works against validation. When a process step consistently gets dropped, the fix usually isn't a better instruction. It's a workflow where that step can't be skipped.
What the reviewer is doing here, conceptually, is mutation testing: for each assertion, asking whether a change to the component would cause it to fail. Incorporating a mutation testing tool like Stryker into your testing pipeline could help catch these issues and reduce cost by avoiding having an agent do it manually.
Time spent designing agent workflows is part of the work and should be considered when evaluating productivity gains from using agents. It's the difference between output you can trust and output that just looks like you can.