How to Write Python Docstrings That Double as a Test Specification

Docstrings and tests describe the same thing. One does it in prose, the other in code. If they diverge — the docstring says "raises ValueError for invalid input" but no test verifies it — you have a contract that nobody's checking.

The good news: docstrings are structured enough to be machine-readable. If you write them consistently, tools like Quell can pull requirements out of them automatically and tell you which ones have no test.

This is how to write docstrings that work for both.

The two formats Quell reads

Both Google style and NumPy style work. Pick whichever your team already uses.

Google style:

def create_order(user_id: int, amount: float, currency: str) -> dict:
    """
    Create a new order for a user.

    Args:
        user_id: ID of the user placing the order.
        amount: Order total in the given currency. Must be positive.
        currency: ISO 4217 currency code.

    Raises:
        ValueError: If amount is less than or equal to 0.
        ValueError: If currency is not supported.
        UserNotFoundError: If user_id does not exist.

    Returns:
        dict with keys: order_id (str), status (str), created_at (float).
    """

NumPy style:

def create_order(user_id: int, amount: float, currency: str) -> dict:
    """
    Create a new order for a user.

    Parameters
    ----------
    amount : float
        Order total. Must be positive.
    currency : str
        ISO 4217 currency code.

    Raises
    ------
    ValueError
        If amount is less than or equal to 0, or currency is not supported.
    UserNotFoundError
        If user_id does not exist.

    Returns
    -------
    dict
        Keys: order_id, status, created_at.
    """

Both produce the same requirements: two MUST_RAISE entries, one MUST_RETURN.

What makes a `Raises:` entry useful

The more specific the condition, the better the generated test. Quell needs to know what input triggers the exception to construct a valid test.

Specific — best:

# Raises:
#     ValueError: If amount is less than or equal to 0.
#     ValueError: If currency is not in ["USD", "EUR", "GBP"].

Quell knows exactly what to pass: amount=0, an invalid currency string. It can construct a test that will fail when the guard is removed.

Vague condition — still useful:

# Raises:
#     ValueError: If input is invalid.

Quell generates a test with generic invalid inputs. The test won't be as targeted, but it will still catch cases where the guard is removed entirely.

External dependency — skipped:

# Raises:
#     RuntimeError: If the external service is unavailable.

There's no way to deterministically trigger this without mocking. Quell flags it as uncovered and skips generation. The LLM fallback can handle it if you configure one, but the rule engine won't hallucinate a test for it.

What makes a `Returns:` entry testable

Quell checks return type and key presence:

# Returns:
#     dict with keys: order_id, status, created_at.

Generated test:

def test_create_order_must_return():
    result = create_order(user_id=1, amount=10.0, currency="USD")
    assert isinstance(result, dict)
    assert "order_id" in result
    assert "status" in result
    assert "created_at" in result

This handles:

dict with keys: X, Y, Z — checks key presence
list of dicts — checks list + element type
Primitive types like str, int, bool

It doesn't auto-generate for complex nested structures or union return types — those get flagged but skipped.

Pydantic models as specs you don't have to write

If you use Pydantic, the model definition is already a requirement specification:

class OrderRequest(BaseModel):
    amount: float = Field(gt=0, le=50_000)
    currency: Literal["USD", "EUR", "GBP", "INR"]
    user_id: int = Field(gt=0)

Quell reads the Field validators and Literal annotations directly — no docstring needed. This generates boundary tests for amount and user_id, and an enum test for currency. Four tests, zero extra documentation written.

Combining both gives you the best coverage:

class CreateOrderInput(BaseModel):
    amount: float = Field(gt=0)
    currency: Literal["USD", "EUR", "GBP"]

def create_order(data: CreateOrderInput) -> dict:
    """
    Raises:
        UserNotFoundError: If data.user_id does not exist in the database.
        InsufficientFundsError: If user balance is below data.amount.

    Returns:
        dict with keys: order_id, status, created_at, total.
    """

Pydantic handles input validation (auto-generated). The docstring handles business logic requirements that depend on external state (flagged for LLM or manual).

Anti-patterns

A few things that reduce what Quell can extract:

Missing the condition: Raises: ValueError with no explanation. Quell knows an exception should happen but not when. Use generic invalid inputs as a fallback, but a specific condition is better.

"If something goes wrong": Too vague to generate a test input from. Worth documenting for human readers, but won't produce an auto-generated test.

Undocumented raises: If your function raises but the docstring doesn't say so, Quell won't know. Make raises explicit — it's good documentation regardless.

Trying it

After writing or updating docstrings:

pip install quelltest
quell check src/ --no-llm

That shows you which requirements are covered and which aren't. With --fix:

quell check src/ --fix --no-llm

Quell generates and verifies tests for each uncovered requirement the rule engine can handle. The .quell/report.json diagnostic shows exactly what was generated, what was skipped, and why — without including your source code.

Quickstart guide — run it on your first file in under two minutes.