100% Code Coverage Is a Lie — Here's What Actually Matters

A while back we were sitting at 94% coverage and felt pretty good about it. Then a customer reported that our payment function was happily accepting zero-amount charges and doing nothing. The tests had run that exact line hundreds of times. They just never bothered to check what happened when you passed amount=0.

That broke my faith in coverage as a signal of quality. Not because coverage is useless — it's not — but because we'd been treating it as proof of correctness when it's really just proof that code executed.

What the badge actually tells you

Coverage tools have one job: answer whether a given line ran during your test suite. That's genuinely useful for finding dead code, spotting untested branches, and making sure you didn't forget to test an entire module.

What they can't tell you: whether the test that ran the line checked anything meaningful. The following test achieves 100% line coverage on a payment function:

def test_payment_runs():
    result = process_payment(amount=0, currency="USD")
    assert result is not None

It ran every line. It checked that something came back. It would pass even if the function accepted negative amounts, returned garbage, or silently failed to charge anyone. The coverage badge goes up. The bug stays.

The thing coverage misses

Every function has a contract — explicit or not. Docstrings, Pydantic models, and type annotations are basically that contract written down:

def process_payment(amount: float, currency: str) -> dict:
    """
    Raises:
        ValueError: If amount is <= 0.
        ValueError: If currency is not in SUPPORTED_CURRENCIES.
    Returns:
        dict with keys: transaction_id, status, timestamp.
    """

That's four testable requirements right there. A typical 100%-coverage test suite might satisfy zero of them and still report green. That's not a coverage failure — the lines ran. It's a requirement coverage failure. The tests never tried to prove those behaviors exist.

This distinction matters a lot in payment code, auth code, data pipelines — anywhere a silent wrong answer is worse than a loud crash.

What "proven" actually means

Here's a useful test: take a specific requirement — say, "raises ValueError when amount is zero" — and ask yourself: if someone deleted that guard tomorrow, would any test catch it?

If the answer is no, you don't have a test for that requirement. You have a test that happens to pass while the requirement exists.

The way to make this rigorous is to actually try it. Temporarily remove the guard, run your tests, see what breaks. If nothing breaks, the tests don't prove the requirement.

That's what mutation testing is — injecting deliberate violations and measuring how many tests catch them. It's the honest version of "are my tests any good." Most teams don't run it because it's slow and hard to interpret. But the question it answers is the right one.

How Quell approaches this

Quell reads your docstrings and Pydantic models, pulls out each stated requirement, and checks whether any test would actually fail if that requirement were violated. For anything uncovered, it generates a test, runs it on the original code (must pass), then injects a violation and runs it again (must fail). Only tests that survive both rounds get written to disk.

quell check src/ --no-llm

  process_payment   MUST_RAISE   ValueError: amount <= 0    ✗ no test
  process_payment   MUST_RAISE   ValueError: bad currency   ✗ no test
  PaymentRequest    ENUM_VALID   currency: USD|EUR|GBP       ✗ no test

No annotation overhead. It reads what you've already written.

Actually fixing it

Coverage isn't wrong to track — it's just not the finish line. A low coverage number is bad. A high one just means you ran the code. Neither tells you whether the code is correct.

What's worth adding alongside it: a check that each documented requirement has a test that would catch a violation. quell check src/ --no-llm gives you that in about 10 seconds on most codebases.

The bug we shipped? It would've been caught immediately. The docstring said "raises ValueError if amount is zero." Quell would've flagged it as uncovered and, with --fix, generated and verified a test for it. We just didn't have that check in place yet.

Install Quell and see what your docstrings are missing. No API key, no config.