Autonomous Test Code Generation That Actually Works
There’s a lot of noise out there about “AI for testing.”
But at Early, we’ve always focused on something more precise: AI for test code generation.
It’s a small wording difference that reflects a big shift in focus.
We don’t test software for you, we generate the test code itself, autonomously, at scale, with higher rigor and structure a senior engineer would use, in a fraction of the time.
And here’s what that looks like in practice:
For the 'packages/common' repo on large scale OSS project called ts-morph, our Repository test generation agent created 1,876 green (passing) unit tests, achieved 76% coverage, and reached up to 91% mutation score. All fully autonomously, as part of a CI run.
Work that used to take a team months, even a full year, now completes in hours, at higher quality.
Table of Content
We set out to benchmark Early’s Repository Agent on a large-scale open source project, no toy examples, no cherry-picking.
The goal was simple:
Could our agent, running fully autonomously inside CI, generate high-quality, green (i.e. working) unit tests for a real-world repository?
No manual prompts. No human editing. Just autonomous test code generation at repo scale.
Before diving into the results, it's important to understand two key metrics we used: code coverage and mutation scores. These metrics are crucial for evaluating test quality, and their interplay provides a comprehensive view of our tests’ effectiveness. Let's start with code coverage.
One of the most widely accepted methods for measuring test quality is code coverage. Code coverage measures which percentage of the code is covered by tests. If I have zero coverage, clearly my tests either don’t exist or are useless, leaving my code unprotected from future bugs and issues.
However, code coverage is an insufficient measurement on its own. While a low code coverage is indicative of poor testing, high coverage does not necessarily indicate high quality. Even with 100% coverage, the quality of tests might be low. For example, if they don’t cover enough cases of different input datasets. Similar to the low-coverage example, the code still has bugs and issues.
Mutation testing is a software testing technique used to evaluate the quality and effectiveness of the tests themselves. The process involves introducing small changes or "mutations" to a program's source code to create modified versions of the program, known as "mutants." The primary objective of mutation testing is to assess whether the existing test suite can detect and fail these mutants, indicating that the test suite is thorough and robust.
In our benchmark, we used Stryker, a mutation testing framework for JavaScript, TypeScript and more.
How mutation testing works:
1. Generating mutants: The first step in mutation testing is to create multiple versions of the original program, each with a slight modification. These modified versions are known as mutants. Common types of mutations include:
a. Changing a logical operator (e.g., replacing && with ||).
b. Modifying a mathematical operator (e.g., replacing + with -).
c. Altering a constant value.
d. Changing a conditional boundary.
2. Running tests on mutants: Each mutant is tested using the existing test suite. The purpose is to check whether the tests can detect the changes (i.e., cause the tests to fail).
3. Analyzing results: After running the tests, the outcomes are analyzed:
a. Killed mutants: If a test fails due to the mutation, the mutant is considered "killed," indicating that the test suite is effective in detecting that type of fault.
b. Survived mutants: If the tests pass despite the mutation, the mutant is considered "survived," indicating that the test suite did not detect the fault.
c. There are other types of “not killed” mutants, like no coverage, timeouts, and errors.
4. Calculating the mutation score: The mutation score is calculated using the formula:

This score provides a quantitative measure of the effectiveness of the test suite.
Relations between code coverage and mutation scores
Complementary Metrics:
- Code coverage and mutation score complement each other. Code coverage ensures that the tests exercise the code, while mutation score ensures that the tests are effective in detecting faults within the exercised code.
Correlation:
- Generally, higher code coverage can lead to a higher mutation score, as more parts of the code are being tested. However, this is not always the case. It's possible to have high code coverage but a low mutation score if the tests are not thorough or effective in catching faults.
Quality assessment:
- A balanced approach using both metrics provides a more comprehensive assessment of test suite quality. High code coverage with a high mutation score indicates that the tests are both extensive and effective. Conversely, high coverage with a low mutation score suggests that the tests need improvement in fault detection, probably with more edge cases tests.
Optimization feedback:
- Code coverage can highlight areas of the code that need more tests, while mutation score can highlight the need for more robust and fault-detecting tests in already covered areas.
Okey, let’s get to the results.
To conduct our benchmark, we used a popular OSS project, ts-morph, and the latest Early Repository Agent for test code generation
Test Project ts-morph
ts-morph is an open-source project that provides a powerful and user-friendly API for working with the TypeScript Compiler API. It is designed to simplify the process of interacting with TypeScript code, enabling developers to create, manipulate and analyze TypeScript code programmatically.
- GitHub Stars: 5,800
- GitHub Forks: 224
- Language TypeScript
- Repository https://github.com/dsherret/ts-morph
- Clone date: June 2024
- Tested package: packages/common/src
- Lines of application code (packages/common/src): 4937 LoC
Setup:
- The original tests were removed from the project’s code to mimic a clean slate. This also allows evaluating unit tests in isolation compared to other forms of tests.
- Tests were generated only for packages/common/src
- Using Early Agent for Repository test code generation
- Attempting to generate unit tests for the 211 public methods in this project
- We tracked three key metrics:
- Test code coverage — how much of the code was exercised.
- Mutation score –how effective the tests were at catching bugs
- We calculated mutation scores for 3 scenarios:
- All methods
- Methods with generated unit tests (any coverage)
- Methods with generated unit tests & 100% coverage
- Total number of working, green tests - the quality output of test code generation.
As coverage increased, test quality (mutation score) increased as well.
For most methods, 100% coverage results 100% mutation score, meaning the generated tests detected every injected mutant (e.g. bug).

Now, let's break down the mutation score for 3 groups
- The entire project - all files: Mutation score is 53%
- Only for methods where Early generated unit tests at any coverage (at least one unit test): Mutation score is 86%
- Only for methods where Early generated unit tests with 100% coverage: Mutation score is 91%

Most AI testing tools stop at coverage. But coverage only tells you how much code is covered, not how well it was tested.
We use mutation testing to go deeper.
It intentionally introduces small code changes (“mutants”), flipping conditions, tweaking operators, and checks whether the tests detect the change.
If a test fails when a mutant is introduced, that’s a good thing, it means the test is actually protecting against real bugs.
So while 76% coverage is strong, what’s more important is that our mutation score climbed from 53% overall to 91% for methods that has 100% coverage.
That’s not just test quantity - that’s test quality.
The graph below shows more accurate mutation score calculation only for methods with certain coverage threshold.
Out of the 211 methods on this project:
- 185 methods had unit tests (i.e. any coverage) and 86% mutation score
- 150 methods had unit tests with 100% coverage and 91% mutation score

Further analysis of the 150 methods with 100% coverage reveal that most of them have mutation score 80% or above, and three quarters (114 methods) have a perfect mutation score of 100%

The Repository Agent acts like an autonomous test engineer living inside your CI and execute the following steps:
- Analyzes your repository to identify testable methods and dependencies.
- Plans test coverage intelligently, prioritizing critical or uncovered areas.
- Generates test code using contextual understanding of your repo.
- Runs and validates those tests, automatically fixing any failing ones.
- Measures coverage and mutation scores to verify the generated test’s quality.
And because it’s embedded directly into CI, the entire process runs automatically and periodically for every repository, pull request, or scheduled build and protects your code continuously.
What We Learned
Running this benchmark at scale gave us several valuable insights:
- Coverage alone isn’t enough. The correlation between coverage and mutation score proved that test quality must be measured, not assumed.
- Autonomous generation works. The agent produced thousands of green tests without human intervention, showing that continuous test code generation in CI is not just possible, but reliable.
Why This Matters
For engineering leaders, it means consistent, measurable quality that scales with your organization and standardizes how tests are created.
For developers, this means more time building, less time maintaining brittle tests.
AI test code generation is no longer a demo feature, it’s an autonomous system that can continuously improve the quality and reliability of your codebase.
This benchmark proves it.
Early’s Repository Agent generated:
- 1,876 working tests
- 76% coverage
- 91% mutation score for methods with 100% coverage
- High coverage with Early generated tests means high quality
- All autonomously, directly from CI.
That’s what real AI test code generation looks like.
We’re expanding this benchmark to more frameworks (React, JS, Python), deeper mutation analysis, and long-term stability testing.
If you’d like to see what the Repository Agent can do on your own repo check our repo-cli-introduction documentation and book a demo here.
And if you want to check out the tests and the results yourself, we’ve made them public you can explore everything in this public Early clone of ts-morph with all the generated tests under package/common/src on github