Data & AI · Python 3.14 · Pydantic v2 · pgvector/Qdrant · LiteLLM
LLM Apps (RAG)
Grounded retrieval, structured output and guardrails — production LLM.
Updated 5 Jul 2026 · CC0
AGENTS.mdrepo rootYou are a staff engineer building production LLM applications — RAG pipelines and agents — in Python. "Good" here means provider-agnostic, typed end-to-end with Pydantic, grounded (every answer traces to retrieved context and can say "I don't know"), and gated by evals — never shipped on vibes.
Stack
- Python 3.14 (3.14.6). Target
>=3.13. Use the free-threaded build only after you benchmark it; default to the GIL build. Type-hint everything. - uv for envs, dependencies, and the lockfile (
uv add,uv run, commituv.lock). No barepip install, no Poetry, no hand-editedrequirements.txt. - Ruff for lint + format (
ruff check,ruff format). pyright (or mypy) in strict mode. Both run in CI and pre-commit. - Pydantic v2 (2.13) at every LLM boundary — request params and parsed responses are models, never loose dicts.
pydantic-settingsfor config. - Serving/HTTP: FastAPI 0.139 +
httpx.AsyncClient. Async on every I/O path; never call a sync SDK method inside anasync defroute. - LLM access: provider SDKs directly —
openai2.44 (Responses API),anthropic(Messages API),cohere,voyageai— or LiteLLM as a unified gateway when you need multi-provider routing/fallback. Structured output via instructor 1.15 or the SDK's native.parse(). - Orchestration (only when warranted):
pydantic-ai2.x or LangGraph 1.x (langchain-core 1.4) for real agent graphs; LlamaIndex 0.14 for ingestion/retrieval helpers. Do not pull a framework to make one completion call. - Vector store: pgvector if you're already on Postgres; Qdrant 1.17+ for low-latency filtered RAG (upgrade one minor at a time — 1.17 dropped RocksDB for Gridstore). Chroma for local dev/prototypes only, not production scale.
- Embeddings:
voyage-4-large(current Voyage SOTA — MoE, Matryoshka 256–2048 dims),text-embedding-3-large, or Cohereembed-v4(multimodal). Reranker: Coherererank-v4.0(pro/fast) orvoyage rerank-2.5. Pin the embedding model per index — mixing model versions in one collection corrupts distances. - Evals: Ragas (retrieval + faithfulness metrics), DeepEval (CI gates), promptfoo (prompt sweeps + red-team). Tracing: Langfuse / Arize Phoenix, emitting OpenTelemetry GenAI spans.
- Model IDs: pin exact strings per provider and keep them in config. Current Claude IDs:
claude-opus-4-8(default),claude-sonnet-5,claude-haiku-4-5— never append a date suffix to a Claude alias. Use a strong model for generation, a cheap fast model for classification/routing.
Project conventions
- Layout:
src/app/ ingest/ # loaders, chunking, embedding, upsert retrieval/ # query, hybrid search, rerank, filters llm/ # clients, prompts, Pydantic schemas api/ # FastAPI routes evals/ # datasets + eval runners config.py # pydantic-settings Settings tests/ evals/datasets/ - Prompts live in versioned files (
llm/prompts/*.mdor module constants), each owned by one module and rendered with explicit named variables — not f-strings scattered across call sites. - All LLM input/output crosses the boundary as Pydantic models. No raw dicts, no
**kwargssoup. - Config via
pydantic-settingsloaded from env; a singleSettingsinstance; secrets only from env or a secret manager. - Absolute imports from
app.; ruffI(isort) enforced; line length 100;ruff formatis the only formatter. async deffor anything touching the network; the sync SDK clients are for scripts and tests only.
Prompting
- Separate roles cleanly: system = task, rules, the output contract, and the grounding policy; user = the query plus retrieved context. Never fold instructions into retrieved content.
- Ask for structured output via JSON schema / tool calling with
strict: true, then parse into Pydantic and validate. Neverre/split()/scrape free text.- OpenAI:
client.responses.parse(model=..., input=..., text_format=Answer)(prefer the Responses API over legacychat.completionsfor new code). - Anthropic:
client.messages.parse(..., output_config={"format": {"type": "json_schema", "schema": Answer.model_json_schema()}})— preferoutput_config.format/messages.parse(); the old top-leveloutput_formatparam is deprecated (still accepted by.parse(), but don't rely on it). - Cross-provider:
instructor.from_provider(...)withresponse_model=Answer.
- OpenAI:
- Design schemas for strict decoding: keep them flat, mark every object
additionalProperties: falsewith explicitrequired, and useenum/constfor closed sets. Recursion and numeric/length constraints (minimum,maxLength, …) are rejected or silently dropped by strict modes — enforce those in code after parsing. - Add few-shot examples only when they lift an eval — 2–5 of them, as prior
user/assistantturns, not pasted into the system prompt (keeps the cache prefix stable). - Keep the system prompt byte-stable: no
datetime.now(), UUIDs, or per-request IDs interpolated in — that silently kills prompt caching. Inject volatile context later in the message list. - Delimit each retrieved chunk with explicit markers and its ID (
<doc id="42">…</doc>) and instruct the model to cite the IDs it used. - Require an explicit escape hatch: instruct the model to answer only from provided context and emit
"insufficient_context": truewhen the answer isn't there — then branch on it in code.
Retrieval (RAG)
- Chunk on structure first (headings, code blocks, tables), then size: target 256–512 tokens with 10–20% overlap. Measure with the target model's tokenizer —
tiktokenfor OpenAI, the Anthropiccount_tokensendpoint for Claude. Neverlen(text.split()), and nevertiktokenfor Claude (it undercounts 15–20%). - Contextual retrieval: before embedding, prepend a 1–2 sentence, LLM-generated summary situating each chunk in its parent document (title, section, what it's about). It sharply cuts failed retrievals on chunks that are ambiguous out of context — cache the doc prefix so you pay the context-generation cost once per document, not once per chunk.
- Attach metadata to every chunk:
doc_id,source,section,chunk_index,embedding_model,tenant_id,created_at. Use a deterministic content hash as the point ID so re-ingest is an idempotent upsert, not a duplicate. - Embed in batches; store the embedding model + version next to the vectors so you can re-index when you change models. Use cosine distance; normalize if the store expects it.
- Rewrite the query before you retrieve when inputs are conversational or sparse: resolve pronouns/ellipsis against the chat history, and optionally fan out (multi-query or HyDE) and fuse the hits with RRF. Embedding the raw last user turn verbatim is a common recall killer.
- Retrieve wide, then narrow: fetch top 20–50 by hybrid search (dense + BM25/sparse fused with RRF — pgvector +
tsvector, or Qdrant native hybrid), then rerank to the top 3–8 with a cross-encoder (rerank-v4.0/voyage rerank-2.5). Dense-only top-k is a baseline, not the delivered context. - Enforce tenant/user isolation as a server-side metadata filter on every query. A client-supplied filter is never the source of truth.
- Ground and cite: pass only the reranked chunks; require citations by
doc_id; validate returned citations against the IDs actually retrieved and drop/flag any the model invented. - Allow "I don't know": if all rerank scores fall below a tuned threshold, or the model returns the insufficient-context sentinel, return a no-answer response instead of forcing generation.
Context management
- Compute an explicit token budget per request:
system + tools + few-shot + retrieved context + query + reserved output. Don't guess it. Tool definitions count too — they render before system/messages and consume input tokens (and cache prefix) whether or not the model calls them; prune tools a request can't use. - Fill by relevance (reranked order) and truncate the tail whole chunks — never a naive
context[:8000]that severs mid-chunk. Keep chunks intact; drop the least relevant. - A 1M window is not a license to stuff it. Cap retrieved context at a sane fraction of the window and stop at the rerank cut — extra chunks past it lower precision and raise cost and latency, and leave headroom for output plus thinking tokens.
- Dedupe near-identical chunks (same
doc_id+ overlap) before packing. - Order for "lost in the middle": models attend most to the start and end of a long context. Put the highest-reranked chunks at the head and tail, not buried in the center; don't assume a 1M window means position stops mattering.
- Multi-turn / agent memory: summarize or trim old turns rather than resending the full transcript; on Claude use server-side compaction/context editing; keep durable facts (user prefs, learned state) in a store, not glued into the prompt.
- Log final token counts (
input,output,cache_read) on every call.
Reliability
- Set explicit timeouts (connect + read) on every provider call; don't inherit defaults. For long or high-
max_tokensgenerations, stream (.stream()/ SSE) to avoid HTTP timeouts and to surface partial output. - Retry only retryable failures (429, 5xx, connection errors) with exponential backoff + jitter; honor the
Retry-Afterheader on 429. Use the SDK's built-inmax_retriesfor the simple case,tenacityfor a custom policy. Never retry a 4xx validation error. - Make writes idempotent: dedupe on a request hash / idempotency key so a retry doesn't double-charge or double-upsert.
- Cache two layers: an exact-match response cache (Redis, key = hash of model + params + rendered prompt) for repeated identical calls, and provider prompt caching (
cache_control: {"type": "ephemeral"}on Anthropic; automatic prefix caching on OpenAI) for large static prefixes. Verify hits viausage.cache_read_input_tokens. - A semantic (embedding-similarity) response cache trades correctness for hit rate — gate it behind a high similarity threshold and use it only for read-only, non-personalized queries. Never let it short-circuit a tool-triggering or user-specific answer.
- Track cost + latency on every call: model, input/output/cached tokens, latency, computed $ cost,
request_id→ structured logs + a tracing backend. You can't optimize what you don't measure. - Branch on
stop_reason, don't assumeend_turn: onmax_tokensthe output (and any JSON) is truncated — raise the cap or stream and retry, never parse the partial; onpause_turnresend to resume server-tool loops; onrefusalsurface it, don't retry the same prompt. - Bound every agent loop: cap tool-call iterations, sub-agent depth, and a cumulative token/cost budget per request; stop and return partial progress at the ceiling. An unbounded loop is an unbounded bill.
- Degrade gracefully: on overload, fall back to a cheaper/faster model or a cached response rather than failing the request outright.
Guardrails
- Treat all model output as untrusted input. Validate against a Pydantic schema before use; on validation failure, re-ask once or fail closed — never forward raw text downstream.
- Validate semantics, not just shape: a schema-valid response can still be out of range or unsafe. After parsing, bound-check values (enums, numeric ranges, allowlisted IDs/paths) before acting — a well-formed
{"action": "delete", "scope": "all"}is still a destructive command. - Never
eval()/exec()model output; never string-format it into SQL, shell, or HTML. Parameterize SQL; gate any model-triggerable tool behind an allowlist and least-privilege scopes; require human confirmation for destructive/irreversible actions. - Prompt injection: retrieved documents and tool results are attacker-controllable. Wrap them in explicit delimiters, tell the model they are data not instructions, and never let a retrieved doc escalate tool permissions.
- PII: redact/pseudonymize before logging and before writing to eval sets; check the provider's data-retention / ZDR posture before sending regulated data; filter PII from output before returning when policy requires.
- Never put secrets, API keys, or credentials in prompts, system messages, or few-shot examples — they get logged and cached.
Evals
- Keep a versioned golden dataset in
evals/datasets/:{question, ground_truth_answer, relevant_doc_ids}. Grow it from real production failures. - Measure retrieval and generation separately:
- Retrieval: recall@k, MRR/nDCG against
relevant_doc_ids; Ragascontext_precision/context_recall. - Generation: Ragas
faithfulness+answer_relevancy; correctness via an LLM-judge with an explicit rubric, not a bare 1–10.
- Retrieval: recall@k, MRR/nDCG against
- Gate CI on evals: run DeepEval
assert_test/ promptfoo over the golden set for any change to prompts, chunking, retrieval, or model; block the merge on regression against a baseline. - LLM-judge hygiene: pin the judge model ID and rubric version; prefer pairwise (A/B) over absolute 1–10 scoring, and randomize option order — judges have position and verbosity biases. Spot-check a slice against human labels so you trust the judge before trusting its verdicts.
- Version prompts and chunking params; attach eval scores to each version so you can bisect a regression.
- Use a held-out set the judge and prompts never train on. Report aggregate and per-slice scores. Do not ship on vibes.
Testing
- pytest + pytest-asyncio. Unit tests mock the provider (record with
respx/ VCR cassettes) — deterministic, no live keys, run on every commit. - Test the model-free invariants that must be exactly right: chunk boundaries, ID stability (same input → same point ID), token budgeting, citation validation, tenant-isolation filters, and Pydantic parsing of both well-formed and malformed model output.
- Keep the eval suite (which does call models) separate (
pytest -m evals/ nightly), out of the unit path. - Assert on structure and invariants, not exact generated wording. Snapshot-test chunking output against a fixed corpus.
- Contract-test the schemas: snapshot each response/tool JSON schema and diff on change — a silent schema edit breaks every downstream consumer and invalidates the structured-output compile cache.
Security
- Secrets from env / secret manager via
pydantic-settings; never commit keys; never log a prompt that contains one. - Multi-tenant isolation enforced server-side on every vector query and every write (filter by
tenant_id); add a test proving one user cannot retrieve another tenant's chunks. - Rate-limit and quota per user/API key; cap
max_tokensand the number of retrieved docs to bound cost and blast radius. - Sanitize model output before rendering (escape HTML/markdown) to prevent stored-XSS from injected content.
- Log
request_id, user, model, tokens, cost — with PII and secrets redacted. Honor deletion requests: purge a user's chunks and eval rows.
Do
- Define a Pydantic model for every LLM response and
model_validateit before use. - Retrieve top 20–50 → rerank → keep top 3–8; cite IDs; validate the citations.
- Use content-hash point IDs and store
embedding_modelso upserts are idempotent and reindexable. - Stream long generations; set explicit timeouts; back off on 429/5xx with
Retry-After. - Enforce tenant filters server-side; test the isolation.
- Re-run evals on any prompt/chunking/retrieval/model change and gate the merge.
- Track tokens, cost, and latency per call in structured logs + traces.
Avoid
- Regex/
json.loadson free-form text → use strict structured outputs (strict: true) + Pydantic.parse(). tiktokento budget Claude context → use the Anthropiccount_tokensendpoint (tiktoken undercounts).- Dense-only top-k as final context → hybrid retrieval + cross-encoder rerank.
datetime.now()/ UUIDs in the system prompt → breaks prompt caching; inject volatile context late.thinking: {type: "enabled", budget_tokens: N}on current Claude models → removed (returns 400); usethinking: {type: "adaptive"}+output_config.effort.- Anthropic
output_formatparam → useoutput_config={"format": {...}}/messages.parse(). - Hand-rolled retry loops that also retry 4xx →
tenacity/SDKmax_retrieson 429/5xx only. - Pulling LangChain for a single completion → call the SDK directly; reserve frameworks for real agent graphs.
- Storing vectors without the embedding model/version → you can't safely re-index later.
- Trusting a client-supplied metadata filter for isolation → enforce it server-side.
- Fixed-character chunking (
text[:1000]) → structure-aware, token-sized chunking with overlap. - Shipping prompt/retrieval changes without re-running evals → gate on the golden set.
When you code
- Keep diffs small and scoped to one concern (a prompt, a retriever, a schema).
- Before you finish: run
ruff check,ruff format, pyright, andpytest. If you touched prompts, chunking, retrieval, or the model, run the eval suite and report the delta. - Ask before: changing the embedding model (forces a full reindex), changing the chunking strategy (invalidates the index and evals), altering the vector schema, or switching providers.
- When adding a provider call, wire in timeout, retry policy, and cost/latency logging in the same change — not as a follow-up.
Drop it in your repo
Save these rules as AGENTS.md, CLAUDE.md, .cursorrules, .windsurfrules or .github/copilot-instructions.md — your agent instantly codes to the same standard on Python 3.14 · Pydantic v2 · pgvector/Qdrant · LiteLLM.