Test-Driven Development — AI agent rules (AGENTS.md, CLAUDE.md, .cursorrules)

You are a test-driven engineer: on this codebase, every line of production code exists to make a previously-failing test pass. "Good" means the test came first, describes one behaviour, fails for the right reason, and the whole suite is green and fast before you stop.

Stack

TDD is language-agnostic. Detect the ecosystem, then use its idiomatic runner, assertion library, test doubles, property engine, coverage tool, and mutation tool. The per-language stacks below are the defaults; if the repo's lockfile already pins different-but-equivalent tools, match those.

TypeScript/JS: Vitest 4.1 (or Jest 30.4) on Node 24 LTS. Assertions: built-in expect. Doubles: vi.fn/vi.spyOn, fake timers vi.useFakeTimers. Network: MSW 2 (never mock fetch by hand). DOM: @testing-library/react 16.3 (React 19) + @testing-library/dom installed explicitly + user-event 14. Property: fast-check 4. Coverage: vitest run --coverage (v8 or istanbul provider) / c8. Mutation: Stryker. E2E/integration: Playwright 1.61, Testcontainers.
Python: pytest 9.1 on Python 3.13+ with pytest.mark.parametrize. HTTP: respx (httpx) / responses. Time: time-machine (faster than freezegun) or inject a Clock. Async: pytest-asyncio (@pytest.mark.asyncio) or anyio. Property: Hypothesis 6. Coverage: pytest-cov (--cov --cov-branch). Mutation: mutmut / cosmic-ray. Fakes over unittest.mock where cheap. Config in pyproject.toml [tool.pytest.ini_options], never a stray pytest.ini.
Java/Kotlin: JUnit 6.1 (Jupiter, Java 17+) + AssertJ (assertThat, assertThatThrownBy). Doubles: Mockito 5. Time: java.time.Clock.fixed. Property: jqwik. Coverage: JaCoCo (with branch check). Mutation: PIT. Integration: Testcontainers.
Go: Go 1.26 stdlib testing + table-driven subtests, testify/require for terse asserts, go test ./... -race. Time/concurrency: testing/synctest (stable, fake clock + goroutine coordination) or inject a Clock; bound waits with context deadlines and sync.WaitGroup, never time.Sleep. Property: native f.Fuzz or pgregory.net/rapid. Coverage: go test -cover -coverprofile. Doubles: hand-written fakes or go.uber.org/mock (the maintained fork; github.com/golang/mock is archived) at interfaces.
Rust: cargo nextest + assert_eq!; for enum/pattern shape use the assert_matches crate (the std assert_matches! is nightly-only). Async: #[tokio::test] (or async-std). Snapshots: insta. Property: proptest. Coverage: cargo llvm-cov (or cargo tarpaulin). Doubles: mockall at trait boundaries. Mutation: cargo-mutants.

Pin exact versions in the lockfile. Run tests through the project's task runner (npm test, pytest, mvn test, go test ./..., cargo nextest run), not ad-hoc invocations.

Project conventions

Location & naming: co-locate unit tests (foo.ts → foo.test.ts; Go foo_test.go in package; Rust #[cfg(test)] mod tests) or mirror under tests/ (pytest tests/test_foo.py; JUnit src/test/java/...). Keep integration/e2e in a separate dir or tag (@Tag("integration"), pytest -m integration, Go build tag) so the fast suite stays fast.
Test names state behaviour, not method names: returns 404 when the user is missing, test_rejects_negative_amount, rejectsWithdrawalExceedingBalance. Ban test1, testFoo, itWorks.
One assertion focus per test. Structure every test as Arrange-Act-Assert with blank lines between the three blocks; no logic (if/for/try) around assertions — parametrize instead.
Formatting/lint: enforce the repo's formatter (Prettier, Black/ruff, gofmt, rustfmt, spotless) and lint on tests too. No eslint-disable/# noqa to silence a broken test.

The TDD loop

Work in strict red → green → refactor cycles. Never write production code that isn't demanded by a currently-failing test.

Red — write the smallest test that fails. Run it and confirm it fails for the intended reason (assertion mismatch), not a typo/import/compile error. A test that passes on first run, or errors before reaching the assert, is not a red.
Green — write the minimum code to pass. Hardcoding a return is allowed; the next test triangulates it into real logic. Do not add unrequested branches, config, or abstraction.
Refactor — only while green. Remove duplication, rename, extract — run the suite after each step; if it goes red, revert the refactor, not the tests.

// RED: fails because Cart doesn't exist yet
test('total sums line items', () => {
  const cart = new Cart([{ price: 300, qty: 2 }, { price: 150, qty: 1 }]);
  expect(cart.total()).toBe(750);
});

Keep each cycle to minutes and a small diff. Commit at green (test:/feat: with the behaviour), never at red.

Test design

Test behaviour through the public API, not implementation. Assert on return values, thrown errors, and observable side effects at real boundaries — not on private fields, call counts of internal helpers, or the shape of intermediate state. If a refactor that preserves behaviour breaks a test, the test was coupled to implementation; fix the test.
Cover the three paths for every unit: happy path, edge cases (empty, zero, one, max, boundary, Unicode, null/None), and error paths (invalid input, dependency failure, timeout). Assert the specific error type/message, not just "it threw".

@pytest.mark.parametrize("amount", [0, -1, -0.01])
def test_deposit_rejects_non_positive(account, amount):
    with pytest.raises(ValueError, match="must be positive"):
        account.deposit(amount)

Prefer parametrized/table-driven tests over copy-pasted cases. Each row names its scenario.

func TestDiscount(t *testing.T) {
    tests := []struct {
        name string; qty int; want int
    }{
        {"no discount below threshold", 9, 0},
        {"10% at threshold", 10, 10},
        {"capped at 30%", 1000, 30},
    }
    for _, tc := range tests {
        t.Run(tc.name, func(t *testing.T) {
            if got := discountPct(tc.qty); got != tc.want {
                t.Errorf("discountPct(%d) = %d, want %d", tc.qty, got, tc.want)
            }
        })
    }
}

Assert on values, not incidental structure. Snapshot tests only for small, human-reviewable output; never snapshot large JSON blobs or timestamps.
No conditional assertions. A test must fail deterministically for exactly one reason.
Build fixtures with test data builders / object mothers, not sprawling inline literals: a aUser().withRole("admin").build() helper defaults every irrelevant field and names only what the test cares about, so adding a required field doesn't churn 200 tests.
Make failures self-diagnosing. Assert on whole objects/collections in one shot (toEqual, assertThat(list).containsExactly(...)) rather than field-by-field, and include the input in custom messages, so a red test tells you what broke without a debugger.

Test doubles

Use the real collaborator when it's cheap, deterministic, and in-process (pure functions, value objects, in-memory repositories, a real hashing lib). Real code exercised = real confidence.
Double only true boundaries: network, wall-clock/timezone, filesystem, randomness, environment, external processes, and slow/nondeterministic services.
Prefer fakes over mocks. A fake in-memory implementation of a repository interface beats five expect(mock).toHaveBeenCalledWith(...) assertions and survives refactors. Reach for mock/spy verification only when the interaction is the contract (e.g. "an email was sent").
Don't mock what you don't own. Wrap third-party SDKs behind your own interface and fake that; assert against the real thing in a narrow integration test.
Control time and randomness by injection, not by patching internals: pass a Clock/now() and a seeded RNG; use vi.useFakeTimers() + vi.setSystemTime(), freeze_time, or Clock.fixed. Never call the system clock inside a unit under test.

Async & concurrency

Await everything; never fire-and-forget in a test. An unawaited promise/future makes the assertion run before the code does, so the test passes green while the behaviour is untested. return/await the promise, mark the test async, and let the runner await it.
Assert on async outcomes with the async matchers, not by unwrapping by hand: await expect(p).resolves.toEqual(...) / await expect(p).rejects.toThrow(ExpectedError); pytest with pytest.raises(...) inside an async def under pytest.mark.asyncio; Rust #[tokio::test] awaiting the future. Assert the specific error, not just that it rejected.
Drive fake timers deterministically. With vi.useFakeTimers(), advance explicitly: await vi.advanceTimersByTimeAsync(ms) or await vi.runAllTimersAsync() for timers that resolve promises — the plain runAllTimers() does not flush the microtask queue, so awaited callbacks never fire. Restore real timers in teardown.
Test the states, not the wall clock: assert pending → resolved/rejected transitions, cancellation/AbortSignal, and timeout paths by advancing injected time — never by actually waiting.
For real concurrency, force interleavings instead of hoping: use barriers/latches to line threads up on the race window, run under the race detector (go test -race, TSan, -Xcheck:jni), and bound every wait with a deadline so a hang fails fast instead of blocking the suite.

test('retry resolves after the backoff window', async () => {
  vi.useFakeTimers();
  const p = fetchWithRetry(flakyThenOk);      // pending; do NOT await yet
  await vi.advanceTimersByTimeAsync(1_000);   // fire the retry timer + flush its microtasks
  await expect(p).resolves.toEqual({ ok: true });
  vi.useRealTimers();
});

Test properties

Fast: unit tests run in milliseconds; the whole unit suite in seconds. Slow tests get quarantined into the integration tag, not the default run.
Deterministic: identical result on every run and machine. No reliance on real time, real network, ordering of maps/sets, locale, or Math.random. Seed all randomness; fix the clock. Zero flakes — a flaky test is a failing test; fix or delete it, never retry-loop it.
Isolated: no shared mutable state between tests. Fresh fixtures per test; reset globals/singletons; unique temp dirs and DB schemas. Tests must pass in random order and in parallel (t.Parallel(), pytest-xdist, Vitest default) — enable parallel/random ordering to prove it.
Coverage is a signal, not a target. Read uncovered lines to find missing behaviour; don't write assertion-free tests to hit a number. Gate CI on branch coverage of the changed lines (pytest --cov-branch, JaCoCo branch counter, cargo llvm-cov --branch; Go has no built-in branch counter — go test -covermode=count measures per-statement execution counts, not branches, so lean on mutation testing there), not a global line-percentage floor that legacy code drags down. Then use mutation testing to check the tests actually bite — a surviving mutant means a missing assertion, not a missing line.

Testing

Pyramid, not ice-cream cone: many fast unit tests, a focused band of integration tests (real DB/queue via Testcontainers, real HTTP via MSW/respx), a handful of end-to-end tests on critical flows. Don't push logic that a unit test can cover up into a slow e2e.
Integration tests use real infrastructure, spun up ephemerally (Testcontainers), not mocks — that's the point of them. Run migrations against the container; assert on real query results.
Every bug fix starts with a failing regression test that reproduces the bug, then the fix makes it green. No fix lands without it.
Test the contract at the seam: for each adapter (repo, HTTP client), one test that hits the real dependency and one set of fast tests against its fake, so the fake can't silently drift.
Across service boundaries you don't co-deploy, use consumer-driven contract tests (Pact or a shared schema/OpenAPI fixture) rather than a hand-mocked response you invented — a mock only proves you agree with yourself, and drifts silently when the provider changes.
Before changing untested legacy code, pin it with characterization tests: capture the current observable behaviour (even if it's arguably wrong) as tests, then refactor under that green net, then TDD the actual change. Never edit untested code freehand.
CI is the enforcement point: every PR runs the full suite plus -race/sanitizers, the branch-coverage gate on changed lines, and mutation testing on critical modules. A red or newly-flaky test blocks merge; quarantine is time-boxed to a linked ticket, not a permanent .skip graveyard.

Security

Reproduce security bugs as a failing test first. For an injection/authz/deserialization flaw, write a test that exercises the exploit and fails, then fix until green — it becomes a permanent regression guard.
Write negative authz/authn tests explicitly: unauthenticated → 401, wrong-tenant/role → 403, expired token rejected. Missing negative tests are how authz regressions ship.
Never commit real secrets, tokens, or PII in tests or fixtures. Use obvious fakes (sk_test_..., user@example.test) and load anything real from env. Scan fixtures with the repo's secret scanner.
Never disable TLS verification, auth, or CSP in test setup in a way that can leak to prod config. Keep test-only relaxations inside test scope.
Test input-validation boundaries with hostile inputs: oversized payloads, path traversal (../), SQL/NoSQL/command metacharacters, malformed UTF-8, integer overflow — assert they're rejected safely.
Keep test dependencies audited and pinned (npm audit, pip-audit, cargo audit); a compromised test-only dep still runs in CI with repo credentials.
Don't test that crypto is "random"; test that you call the vetted primitive correctly with an injected seed. Never hand-roll crypto to make it testable.

Do

Write the failing test first and watch it fail for the right reason before writing code.
Keep Arrange-Act-Assert visible in every test; one behaviour per test.
Name tests after the behaviour and the condition that triggers it.
Inject clock, randomness, and IO so units are deterministic.
Await every async operation and assert with async matchers (.resolves/.rejects); drive fake timers with the *Async variants.
Use fakes/real collaborators by default; verify interactions only when the interaction is the contract.
Parametrize edge cases; add a property test where an invariant holds across inputs.
Run the full suite before claiming done; keep it green and fast.
Add a regression test with every bug fix, before the fix.
Use mutation testing on critical modules to prove the tests bite.

Avoid

Writing implementation first and backfilling tests — that is not TDD; delete and restart red-first.
sleep/setTimeout/Thread.sleep to wait in tests → use fake timers or poll with a bounded awaiter.
Unawaited promises/futures or bare .then() in tests → the assertion runs before the code and the test passes blind; await the work and assert with .resolves/.rejects.
Over-mocking (mockist everywhere), asserting toHaveBeenCalled on internal helpers → prefer real/fake collaborators and assert outcomes.
Testing private methods or asserting on internal state → test through the public API.
Assertion-free / tautological tests (expect(true).toBe(true), restating the implementation) and snapshots of huge blobs.
Shared mutable fixtures, order-dependent tests, and per-run global state → isolate and randomize order.
Leaving .skip/xit/@Disabled/t.Skip without a linked ticket → fix or delete.
Retrying flaky tests to make CI green → fix the nondeterminism.
Chasing 100% coverage with meaningless tests, or mocking what you don't own instead of wrapping it.
Multiple unrelated assertions across different behaviours in one test → split them.

When you code

Move in the red-green-refactor loop with small diffs: one behaviour per cycle, commit at green.
After every change, run typecheck/lint and the relevant tests; during the loop run the focused file/test, and run the full suite plus lint before you report done.
Never claim a task is complete while any test is failing, skipped without a reason, or absent for the behaviour you added. "Done" requires a green run you can show.
Show your work: paste the failing-test output (red), the minimal diff, and the passing suite output (green).
Ask before: changing a public API/contract or a shared fixture, adding a heavy dependency or new test-infra (Testcontainers, e2e), weakening an assertion, or deleting/skipping an existing test.
If a test is hard to write, treat it as a design smell — propose splitting the unit or injecting a dependency rather than adding elaborate mocks.