Why We Built a Verification Engine Instead of Just Generating Tests

Generating test code is not a hard problem. An LLM does it in seconds. A template engine does it in milliseconds. The hard part — the one nobody talks about — is whether the generated test actually proves anything.

When we started building Quell, the tempting path was to just generate tests and ship them. Parse the docstring, construct a test, write it to disk. Fast, impressive-looking output, done. A few thousand lines of code and you have something that looks like a test generator.

We nearly did that. Then we ran some of those generated tests through mutation testing.

Around 20% of them were useless. They ran the code, they asserted something, they showed up in coverage — and when we injected violations into the source, they still passed. The test had no idea the requirement was broken. It was providing confidence it hadn't earned.

That's when we decided to build the verification engine, even though it would triple our development time.

The actual problem with generated tests

Here's what typically comes out of an LLM or template-based test generator for a payment function:

def test_process_payment():
    result = process_payment(10.0, "USD")
    assert result is not None
    assert "status" in result

This test is fine as far as it goes. It runs the happy path, checks two things, doesn't fail. But the docstring on that function says "raises ValueError if amount is zero or negative." This test never calls process_payment with a bad amount. It doesn't touch that code path.

Coverage tools mark process_payment as covered. The requirement is silently unverified. Nobody gets paged — until a customer finds the bug.

What "verified" means

A test proves a requirement if and only if it would fail when that requirement is violated. There's no other definition.

You can't know this from reading the test. You have to actually try it: remove the guard, run the test, see what happens. That experiment is what Quell's verification engine automates.

Two phases:

Phase 1: Run the generated test on the original unmodified source. If it fails for any reason — wrong exception type, bad arguments, syntax error — the test is discarded immediately. No point proceeding.

Phase 2: Inject a violation into the source. For a MUST_RAISE requirement, that means commenting out the raise statement. For a BOUNDARY requirement, it means weakening the Field validator. For MUST_RETURN, replacing the return value with None. Then run the test again. If it still passes, the test doesn't prove the requirement. Also discarded.

Only tests that pass phase 1 and fail phase 2 get written to disk.

The engineering bit that was actually hard

The violation injection sounds simple until you try to implement it on real code.

The naive version — regex-replace the raise statement — breaks constantly. Two functions in the same file with similar patterns. A guard inside a loop inside a conditional. A raise that's three levels of indentation deep. We spent about a week on this before switching to AST-based line range targeting.

The key insight: each requirement is tied to a specific function and line range, extracted during the scan phase. When injecting a violation, Quell modifies only the AST nodes within that range and leaves everything else alone. A sloppy injection that breaks unrelated code would cause phase 1 to fail for every test, making verification worthless.

The finally block is non-negotiable:

try:
    _inject_violation(source_path, requirement)
    result = _run_pytest(test_file)
    return result.returncode != 0
finally:
    source_path.write_text(original_source)

No matter what happens — test crash, pytest segfault, keyboard interrupt — the original source comes back. This is the invariant we've never broken. We don't run the verifier in-process for exactly this reason: if pytest imports your module and caches it, the module cache doesn't reflect the violated source, and you get false negatives.

What we considered and rejected

We thought about running mutmut or Stryker after generation. The problem: mutation testing runs every mutant against your entire test suite. It's measuring something different — how many of your existing tests catch each injected bug. We needed to know whether a specific generated test caught its specific violation. That's a targeted experiment, not a full mutation run. And mutation testing on a real project takes 15-30 minutes. That's not compatible with a generation step.

We also considered a --verify flag that's off by default. Some tools do this. We decided against it pretty quickly: if you're writing tests to disk, they should be verified. Unverified tests that look authoritative are exactly the problem we're trying to solve. Making verification opt-in would mean most users never opt in, and we'd be shipping the same garbage as everyone else.

The honest tradeoff

Verification is slower. A rule engine generates test code in under 10ms. Two subprocess pytest runs add 2-5 seconds per test. On a file with 20 uncovered requirements, you're looking at a minute of verification time.

We think that's the right call. Tests are written once and run forever. A 60-second process that produces 20 tests you can trust is worth more than a 2-second process that produces 20 tests of unknown quality — some of which are actively misleading.

The data from running this on real projects: about 15-20% of generated tests fail phase 2. They look correct, they run green, but they don't catch the violation they're supposed to catch. Without verification, every one of those would have landed in your test suite providing false confidence.

That number is what made the engineering cost worth it.

Quell is open source. Read the verifier code on GitHub or try it on your own project.