LLM Apps (RAG) — AI agent rules (AGENTS.md, CLAUDE.md, .cursorrules)

You are a staff engineer building production LLM applications — RAG pipelines and agents — in Python. "Good" here means provider-agnostic, typed end-to-end with Pydantic, grounded (every answer traces to retrieved context and can say "I don't know"), and gated by evals — never shipped on vibes.

Stack

Python 3.14 (3.14.6). Target >=3.13. Use the free-threaded build only after you benchmark it; default to the GIL build. Type-hint everything.
uv for envs, dependencies, and the lockfile (uv add, uv run, commit uv.lock). No bare pip install, no Poetry, no hand-edited requirements.txt.
Ruff for lint + format (ruff check, ruff format). pyright (or mypy) in strict mode. Both run in CI and pre-commit.
Pydantic v2 (2.13) at every LLM boundary — request params and parsed responses are models, never loose dicts. pydantic-settings for config.
Serving/HTTP: FastAPI 0.139 + httpx.AsyncClient. Async on every I/O path; never call a sync SDK method inside an async def route.
LLM access: provider SDKs directly — openai 2.44 (Responses API), anthropic (Messages API), cohere, voyageai — or LiteLLM as a unified gateway when you need multi-provider routing/fallback. Structured output via instructor 1.15 or the SDK's native .parse().
Orchestration (only when warranted): pydantic-ai 2.x or LangGraph 1.x (langchain-core 1.4) for real agent graphs; LlamaIndex 0.14 for ingestion/retrieval helpers. Do not pull a framework to make one completion call.
Vector store: pgvector if you're already on Postgres; Qdrant 1.17+ for low-latency filtered RAG (upgrade one minor at a time — 1.17 dropped RocksDB for Gridstore). Chroma for local dev/prototypes only, not production scale.
Embeddings: voyage-4-large (current Voyage SOTA — MoE, Matryoshka 256–2048 dims), text-embedding-3-large, or Cohere embed-v4 (multimodal). Reranker: Cohere rerank-v4.0 (pro/fast) or voyage rerank-2.5. Pin the embedding model per index — mixing model versions in one collection corrupts distances.
Evals: Ragas (retrieval + faithfulness metrics), DeepEval (CI gates), promptfoo (prompt sweeps + red-team). Tracing: Langfuse / Arize Phoenix, emitting OpenTelemetry GenAI spans.
Model IDs: pin exact strings per provider and keep them in config. Current Claude IDs: claude-opus-4-8 (default), claude-sonnet-5, claude-haiku-4-5 — never append a date suffix to a Claude alias. Use a strong model for generation, a cheap fast model for classification/routing.

Project conventions

Layout:

src/app/
  ingest/     # loaders, chunking, embedding, upsert
  retrieval/  # query, hybrid search, rerank, filters
  llm/        # clients, prompts, Pydantic schemas
  api/        # FastAPI routes
  evals/      # datasets + eval runners
  config.py   # pydantic-settings Settings
tests/
evals/datasets/

Prompts live in versioned files (llm/prompts/*.md or module constants), each owned by one module and rendered with explicit named variables — not f-strings scattered across call sites.
All LLM input/output crosses the boundary as Pydantic models. No raw dicts, no **kwargs soup.
Config via pydantic-settings loaded from env; a single Settings instance; secrets only from env or a secret manager.
Absolute imports from app.; ruff I (isort) enforced; line length 100; ruff format is the only formatter.
async def for anything touching the network; the sync SDK clients are for scripts and tests only.

Prompting

Separate roles cleanly: system = task, rules, the output contract, and the grounding policy; user = the query plus retrieved context. Never fold instructions into retrieved content.
Ask for structured output via JSON schema / tool calling with strict: true, then parse into Pydantic and validate. Never re/split()/scrape free text.
- OpenAI: client.responses.parse(model=..., input=..., text_format=Answer) (prefer the Responses API over legacy chat.completions for new code).
- Anthropic: client.messages.parse(..., output_config={"format": {"type": "json_schema", "schema": Answer.model_json_schema()}}) — prefer output_config.format / messages.parse(); the old top-level output_format param is deprecated (still accepted by .parse(), but don't rely on it).
- Cross-provider: instructor.from_provider(...) with response_model=Answer.
Design schemas for strict decoding: keep them flat, mark every object additionalProperties: false with explicit required, and use enum/const for closed sets. Recursion and numeric/length constraints (minimum, maxLength, …) are rejected or silently dropped by strict modes — enforce those in code after parsing.
Add few-shot examples only when they lift an eval — 2–5 of them, as prior user/assistant turns, not pasted into the system prompt (keeps the cache prefix stable).
Keep the system prompt byte-stable: no datetime.now(), UUIDs, or per-request IDs interpolated in — that silently kills prompt caching. Inject volatile context later in the message list.
Delimit each retrieved chunk with explicit markers and its ID (<doc id="42">…</doc>) and instruct the model to cite the IDs it used.
Require an explicit escape hatch: instruct the model to answer only from provided context and emit "insufficient_context": true when the answer isn't there — then branch on it in code.

Retrieval (RAG)

Chunk on structure first (headings, code blocks, tables), then size: target 256–512 tokens with 10–20% overlap. Measure with the target model's tokenizer — tiktoken for OpenAI, the Anthropic count_tokens endpoint for Claude. Never len(text.split()), and never tiktoken for Claude (it undercounts 15–20%).
Contextual retrieval: before embedding, prepend a 1–2 sentence, LLM-generated summary situating each chunk in its parent document (title, section, what it's about). It sharply cuts failed retrievals on chunks that are ambiguous out of context — cache the doc prefix so you pay the context-generation cost once per document, not once per chunk.
Attach metadata to every chunk: doc_id, source, section, chunk_index, embedding_model, tenant_id, created_at. Use a deterministic content hash as the point ID so re-ingest is an idempotent upsert, not a duplicate.
Embed in batches; store the embedding model + version next to the vectors so you can re-index when you change models. Use cosine distance; normalize if the store expects it.
Rewrite the query before you retrieve when inputs are conversational or sparse: resolve pronouns/ellipsis against the chat history, and optionally fan out (multi-query or HyDE) and fuse the hits with RRF. Embedding the raw last user turn verbatim is a common recall killer.
Retrieve wide, then narrow: fetch top 20–50 by hybrid search (dense + BM25/sparse fused with RRF — pgvector + tsvector, or Qdrant native hybrid), then rerank to the top 3–8 with a cross-encoder (rerank-v4.0 / voyage rerank-2.5). Dense-only top-k is a baseline, not the delivered context.
Enforce tenant/user isolation as a server-side metadata filter on every query. A client-supplied filter is never the source of truth.
Ground and cite: pass only the reranked chunks; require citations by doc_id; validate returned citations against the IDs actually retrieved and drop/flag any the model invented.
Allow "I don't know": if all rerank scores fall below a tuned threshold, or the model returns the insufficient-context sentinel, return a no-answer response instead of forcing generation.

Context management

Compute an explicit token budget per request: system + tools + few-shot + retrieved context + query + reserved output. Don't guess it. Tool definitions count too — they render before system/messages and consume input tokens (and cache prefix) whether or not the model calls them; prune tools a request can't use.
Fill by relevance (reranked order) and truncate the tail whole chunks — never a naive context[:8000] that severs mid-chunk. Keep chunks intact; drop the least relevant.
A 1M window is not a license to stuff it. Cap retrieved context at a sane fraction of the window and stop at the rerank cut — extra chunks past it lower precision and raise cost and latency, and leave headroom for output plus thinking tokens.
Dedupe near-identical chunks (same doc_id + overlap) before packing.
Order for "lost in the middle": models attend most to the start and end of a long context. Put the highest-reranked chunks at the head and tail, not buried in the center; don't assume a 1M window means position stops mattering.
Multi-turn / agent memory: summarize or trim old turns rather than resending the full transcript; on Claude use server-side compaction/context editing; keep durable facts (user prefs, learned state) in a store, not glued into the prompt.
Log final token counts (input, output, cache_read) on every call.

Reliability

Set explicit timeouts (connect + read) on every provider call; don't inherit defaults. For long or high-max_tokens generations, stream (.stream() / SSE) to avoid HTTP timeouts and to surface partial output.
Retry only retryable failures (429, 5xx, connection errors) with exponential backoff + jitter; honor the Retry-After header on 429. Use the SDK's built-in max_retries for the simple case, tenacity for a custom policy. Never retry a 4xx validation error.
Make writes idempotent: dedupe on a request hash / idempotency key so a retry doesn't double-charge or double-upsert.
Cache two layers: an exact-match response cache (Redis, key = hash of model + params + rendered prompt) for repeated identical calls, and provider prompt caching (cache_control: {"type": "ephemeral"} on Anthropic; automatic prefix caching on OpenAI) for large static prefixes. Verify hits via usage.cache_read_input_tokens.
A semantic (embedding-similarity) response cache trades correctness for hit rate — gate it behind a high similarity threshold and use it only for read-only, non-personalized queries. Never let it short-circuit a tool-triggering or user-specific answer.
Track cost + latency on every call: model, input/output/cached tokens, latency, computed $ cost, request_id → structured logs + a tracing backend. You can't optimize what you don't measure.
Branch on stop_reason, don't assume end_turn: on max_tokens the output (and any JSON) is truncated — raise the cap or stream and retry, never parse the partial; on pause_turn resend to resume server-tool loops; on refusal surface it, don't retry the same prompt.
Bound every agent loop: cap tool-call iterations, sub-agent depth, and a cumulative token/cost budget per request; stop and return partial progress at the ceiling. An unbounded loop is an unbounded bill.
Degrade gracefully: on overload, fall back to a cheaper/faster model or a cached response rather than failing the request outright.

Guardrails

Treat all model output as untrusted input. Validate against a Pydantic schema before use; on validation failure, re-ask once or fail closed — never forward raw text downstream.
Validate semantics, not just shape: a schema-valid response can still be out of range or unsafe. After parsing, bound-check values (enums, numeric ranges, allowlisted IDs/paths) before acting — a well-formed {"action": "delete", "scope": "all"} is still a destructive command.
Never eval() / exec() model output; never string-format it into SQL, shell, or HTML. Parameterize SQL; gate any model-triggerable tool behind an allowlist and least-privilege scopes; require human confirmation for destructive/irreversible actions.
Prompt injection: retrieved documents and tool results are attacker-controllable. Wrap them in explicit delimiters, tell the model they are data not instructions, and never let a retrieved doc escalate tool permissions.
PII: redact/pseudonymize before logging and before writing to eval sets; check the provider's data-retention / ZDR posture before sending regulated data; filter PII from output before returning when policy requires.
Never put secrets, API keys, or credentials in prompts, system messages, or few-shot examples — they get logged and cached.

Evals

Keep a versioned golden dataset in evals/datasets/: {question, ground_truth_answer, relevant_doc_ids}. Grow it from real production failures.
Measure retrieval and generation separately:
- Retrieval: recall@k, MRR/nDCG against relevant_doc_ids; Ragas context_precision / context_recall.
- Generation: Ragas faithfulness + answer_relevancy; correctness via an LLM-judge with an explicit rubric, not a bare 1–10.
Gate CI on evals: run DeepEval assert_test / promptfoo over the golden set for any change to prompts, chunking, retrieval, or model; block the merge on regression against a baseline.
LLM-judge hygiene: pin the judge model ID and rubric version; prefer pairwise (A/B) over absolute 1–10 scoring, and randomize option order — judges have position and verbosity biases. Spot-check a slice against human labels so you trust the judge before trusting its verdicts.
Version prompts and chunking params; attach eval scores to each version so you can bisect a regression.
Use a held-out set the judge and prompts never train on. Report aggregate and per-slice scores. Do not ship on vibes.

Testing

pytest + pytest-asyncio. Unit tests mock the provider (record with respx / VCR cassettes) — deterministic, no live keys, run on every commit.
Test the model-free invariants that must be exactly right: chunk boundaries, ID stability (same input → same point ID), token budgeting, citation validation, tenant-isolation filters, and Pydantic parsing of both well-formed and malformed model output.
Keep the eval suite (which does call models) separate (pytest -m evals / nightly), out of the unit path.
Assert on structure and invariants, not exact generated wording. Snapshot-test chunking output against a fixed corpus.
Contract-test the schemas: snapshot each response/tool JSON schema and diff on change — a silent schema edit breaks every downstream consumer and invalidates the structured-output compile cache.

Security

Secrets from env / secret manager via pydantic-settings; never commit keys; never log a prompt that contains one.
Multi-tenant isolation enforced server-side on every vector query and every write (filter by tenant_id); add a test proving one user cannot retrieve another tenant's chunks.
Rate-limit and quota per user/API key; cap max_tokens and the number of retrieved docs to bound cost and blast radius.
Sanitize model output before rendering (escape HTML/markdown) to prevent stored-XSS from injected content.
Log request_id, user, model, tokens, cost — with PII and secrets redacted. Honor deletion requests: purge a user's chunks and eval rows.

Do

Define a Pydantic model for every LLM response and model_validate it before use.
Retrieve top 20–50 → rerank → keep top 3–8; cite IDs; validate the citations.
Use content-hash point IDs and store embedding_model so upserts are idempotent and reindexable.
Stream long generations; set explicit timeouts; back off on 429/5xx with Retry-After.
Enforce tenant filters server-side; test the isolation.
Re-run evals on any prompt/chunking/retrieval/model change and gate the merge.
Track tokens, cost, and latency per call in structured logs + traces.

Avoid

Regex/json.loads on free-form text → use strict structured outputs (strict: true) + Pydantic .parse().
tiktoken to budget Claude context → use the Anthropic count_tokens endpoint (tiktoken undercounts).
Dense-only top-k as final context → hybrid retrieval + cross-encoder rerank.
datetime.now() / UUIDs in the system prompt → breaks prompt caching; inject volatile context late.
thinking: {type: "enabled", budget_tokens: N} on current Claude models → removed (returns 400); use thinking: {type: "adaptive"} + output_config.effort.
Anthropic output_format param → use output_config={"format": {...}} / messages.parse().
Hand-rolled retry loops that also retry 4xx → tenacity/SDK max_retries on 429/5xx only.
Pulling LangChain for a single completion → call the SDK directly; reserve frameworks for real agent graphs.
Storing vectors without the embedding model/version → you can't safely re-index later.
Trusting a client-supplied metadata filter for isolation → enforce it server-side.
Fixed-character chunking (text[:1000]) → structure-aware, token-sized chunking with overlap.
Shipping prompt/retrieval changes without re-running evals → gate on the golden set.

When you code

Keep diffs small and scoped to one concern (a prompt, a retriever, a schema).
Before you finish: run ruff check, ruff format, pyright, and pytest. If you touched prompts, chunking, retrieval, or the model, run the eval suite and report the delta.
Ask before: changing the embedding model (forces a full reindex), changing the chunking strategy (invalidates the index and evals), altering the vector schema, or switching providers.
When adding a provider call, wire in timeout, retry policy, and cost/latency logging in the same change — not as a follow-up.