All posts
·4 min read·0 views·Shashank Bindal

Your Docstrings Are Already a Test Spec — You're Just Not Using Them

Python docstrings contain every testable requirement your functions have. Learn how to turn Raises:, Returns:, and Args: blocks into verified pytest tests automatically.

Your Docstrings Are Already a Test Spec — You're Just Not Using Them

I've written a lot of docstrings. For a long time, I wrote them for two audiences: the colleague who'd read the source, and Sphinx. It took an embarrassingly long time to notice they were also a complete test specification — one I was ignoring.

Look at a typical well-written docstring:

def create_user(email: str, role: str = "viewer") -> dict:
    """
    Create a new user account.

    Args:
        email: Valid email address for the new account.
        role: Account role. Must be one of: admin, editor, viewer.

    Raises:
        ValueError: If email format is invalid.
        ValueError: If role is not one of the allowed values.
        DuplicateEmailError: If an account with this email already exists.

    Returns:
        dict with keys: user_id, email, role, created_at.
    """

Count what's in there: five distinct, testable claims about how this function behaves. If you asked someone to write tests for this function, those five claims would be the spec. You already wrote the spec. You just wrote it in a docstring instead of a test file.

The translation gap

The docstring says "raises ValueError for invalid email." Most test suites have something like this:

def test_create_user():
    user = create_user("alice@example.com")
    assert user["email"] == "alice@example.com"

That runs the happy path. It doesn't touch the error cases at all. If someone deleted the email validation tomorrow, this test would still pass. The docstring and the test suite are describing completely different things, and nobody's checking whether they agree.

This is the gap. Not missing coverage in the line-count sense — the function is probably "covered." Missing requirement coverage: none of the documented behaviors actually have a test that would catch a regression.

Writing tests that prove requirements

A test that proves a requirement has one property: it would fail if the requirement were violated. For the ValueError on invalid email, that's:

def test_create_user_raises_on_invalid_email():
    with pytest.raises(ValueError):
        create_user("not-an-email")

Remove the email validation guard and this test fails. That's what "proven" means. The first style of test — the happy path — doesn't prove anything about error handling. This one does.

Same pattern for the role check:

def test_create_user_raises_on_invalid_role():
    with pytest.raises(ValueError):
        create_user("alice@example.com", role="superadmin")

Simple, targeted, directly tied to something in the docstring.

Reading requirements automatically

Docstrings are structured enough that they're machine-readable. The Raises:, Returns:, and Args: sections follow consistent patterns whether you're writing Google style, NumPy style, or Sphinx style. Each entry maps to a typed requirement:

  • Raises: ValueError: If...MUST_RAISE ValueError
  • Returns: dict with keys: X, YMUST_RETURN dict
  • Args: amount: Must be positiveBOUNDARY

Quell parses these sections from your AST — no LLM, no network call — and tells you which ones have no test:

quell check src/users.py --no-llm
  create_user   MUST_RAISE   ValueError: invalid email    ✗ no test
  create_user   MUST_RAISE   ValueError: invalid role     ✗ no test
  create_user   MUST_RAISE   DuplicateEmailError          ✗ no test
  create_user   MUST_RETURN  dict with user_id key        ✓ covered

That 25% coverage number is on requirements, not lines. The function might show up as 90% covered in pytest-cov.

Generating the missing tests

With --fix, Quell generates a pytest test for each gap, then runs a two-phase check:

  1. Run on the original source → must pass
  2. Inject a violation (comment out the raise, weaken the boundary) → must fail

Only tests that pass both rounds get written to disk. It's not just generation — it's generation plus proof.

quell check src/users.py --fix --no-llm
  create_user   MUST_RAISE   ValueError: invalid email   ✓ verified → written
  create_user   MUST_RAISE   ValueError: invalid role    ✓ verified → written
  create_user   MUST_RAISE   DuplicateEmailError         ✗ skipped (external dependency)

  2 tests written → tests/test_users.py
  Your code never left your machine.

The DuplicateEmailError case gets skipped — checking for a duplicate requires a database state Quell can't set up deterministically. That's the honest answer, not a hallucinated test.

Pydantic models work too

If you're using Pydantic, the model itself is already a requirement specification — no docstring needed:

class PaymentRequest(BaseModel):
    amount: float = Field(gt=0, le=10_000)
    currency: Literal["USD", "EUR", "GBP", "INR"]

Quell reads the Field validators and Literal annotations directly. Two boundary tests, one enum test, all auto-generated and verified. The model is the spec.

Actually trying it

If you have any Python file with docstrings, this takes about two minutes:

pip install quelltest
quell check src/ --no-llm

The output shows you which documented requirements have no test. Most codebases have more gaps than expected — not because the tests are bad, but because the happy-path tests and the error-path requirements were written independently and never reconciled.

Your docstrings are already there. Start using them.


Full docs — or just pip install quelltest and run it on whatever's in front of you.

Try Quelltest

Install Quelltest and run it on your codebase — no API key, no configuration.