PyTorch / ML — AI agent rules (AGENTS.md, CLAUDE.md, .cursorrules)

You are writing PyTorch training and modeling code for research and production. "Good" means reproducible, leak-free, device-agnostic, memory-efficient, and numerically correct — every run seeded and logged, every eval under no_grad/eval(), checkpoints that restore exactly, and no silent NaNs. Prefer plain, explicit torch over framework magic; reach for a trainer library only when it earns its keep.

Stack

Python 3.13 (3.12–3.14 supported; pin one in pyproject.toml). torch requires 3.10+.
torch 2.12.1 — install CUDA/ROCm/XPU wheels from the official index, matched to your driver: uv pip install torch==2.12.1 --index-url https://download.pytorch.org/whl/cu130 (CUDA 13). Never pip install torch from PyPI for GPU boxes.
torchvision 0.27.1, torchaudio 2.12.1 — versions are lockstep with torch; upgrade all three together.
NumPy 2.5.x (2.x ABI; ensure C-extension deps are 2.x-built).
safetensors for weight serialization/sharing (not pickle). einops for readable reshapes. torchmetrics for metrics.
Config: Hydra + OmegaConf or pydantic-settings. Tracking: Weights & Biases, MLflow, or TensorBoard (torch.utils.tensorboard). Data/model versioning: DVC or W&B Artifacts.
Tooling: uv 0.11 (env + lockfile), ruff 0.15 (lint + format), pyright or mypy (type check), pytest 9 (tests).
AMP: torch.amp.autocast("cuda") and torch.amp.GradScaler("cuda"). The torch.cuda.amp.* forms are deprecated (since 2.4) — do not use them.
Distributed: torchrun + DistributedDataParallel; FSDP2 (torch.distributed.fsdp.fully_shard) for sharded large-model training. DataParallel is legacy — never use it.

Project conventions

src/<pkg>/
  data/        datasets.py, transforms.py, datamodule.py
  models/      layers.py, <arch>.py         # each nn.Module in its own file
  train.py     # entrypoint: build -> fit -> checkpoint
  eval.py
  engine.py    # train_one_epoch / evaluate loops
  utils/       seed.py, distributed.py, logging.py, checkpoint.py
configs/       *.yaml (Hydra)
tests/
pyproject.toml

One nn.Module per file; forward reads top-to-bottom with no hidden global state.
Absolute imports (from src.models.resnet import ResNet); no import *.
Ruff formats and lints (ruff format, ruff check --fix); line length 100. Enable rule sets E,F,I,UP,B,NPY,PTH,SIM.
Full type hints on public functions. Annotate tensor semantics in the signature or docstring (x: Tensor # (B, C, H, W)); document dtype/device expectations.
No compute at import time; guard entrypoints with if __name__ == "__main__":.

Reproducibility

Seed everything from one function, once, before any RNG use:

def seed_everything(seed: int) -> None:
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

For strict determinism: torch.use_deterministic_algorithms(True), torch.backends.cudnn.benchmark = False, and set env CUBLAS_WORKSPACE_CONFIG=:4096:8. Accept the speed cost; keep benchmark = True only when you explicitly want throughput over bit-exactness.
Seed DataLoader workers so shuffles/augmentations are reproducible: pass a generator=torch.Generator().manual_seed(seed) and a worker_init_fn that seeds numpy/random per worker from torch.initial_seed().
Log the full resolved config, all hyperparameters, git commit SHA, torch.__version__, CUDA version, and GPU name into the run tracker at startup. A checkpoint without its config is unreproducible.
Version the dataset (hash or DVC/artifact ref) alongside code. Record the exact data split.

Data

Subclass torch.utils.data.Dataset (__len__, __getitem__) or an IterableDataset for streaming. Do I/O and decode in __getitem__; keep tensors CPU-side there.
DataLoader for GPU training: num_workers=min(8, os.cpu_count()), pin_memory=True, persistent_workers=True, prefetch_factor=2, drop_last=True for train. Transfer with x.to(device, non_blocking=True) (pairs with pin_memory).
No leakage. Split into train/val/test before fitting anything. Fit scalers, vocab, class weights, PCA, and normalization stats on train only, then apply frozen to val/test. For grouped data (patient, user, session), split by group so no group spans two sets.
Normalize with train statistics only. Store mean/std in the checkpoint and reuse at inference — recomputing on eval data is leakage.
shuffle=True for train, False for val/test. Never shuffle before a temporal split.
Augment train only; val/test get deterministic preprocessing. Use torchvision.transforms.v2 (the v1 transforms API is legacy).

Model

Subclass nn.Module; register submodules/parameters as attributes (or nn.ModuleList/ModuleDict — a plain Python list hides params from .parameters() and .to()).
Resolve one device centrally and be device-agnostic:

device = torch.accelerator.current_accelerator() if torch.accelerator.is_available() else torch.device("cpu")

Never hardcode .cuda() or "cuda:0". Move model and every input tensor to the same device; model.to(device) mutates in place, but tensor.to(device) returns a copy — reassign it.

Init explicitly (nn.init.kaiming_normal_, etc.); do not rely on defaults for custom layers. Use nn.LayerNorm/nn.BatchNorm2d correctly (BN needs eval() to freeze running stats).
forward returns raw logits; keep loss (nn.CrossEntropyLoss, which applies log-softmax internally — do not softmax first) and activation out of the model.
set_float32_matmul_precision("high") to enable TF32 matmuls on Ampere+ when full fp32 precision isn't required.

Training loop

Canonical step:

model.train()
for x, y in train_loader:
    x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
    optimizer.zero_grad(set_to_none=True)
    with torch.amp.autocast("cuda", dtype=torch.bfloat16):  # bf16: no GradScaler
        loss = criterion(model(x), y)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()

AMP: prefer dtype=torch.bfloat16 on Ampere+ — it has fp32 range, so no GradScaler: use the plain loss.backward()/optimizer.step() above. Only on older GPUs needing torch.float16 do you add a scaler, and then the flow is scaler.scale(loss).backward() → scaler.unscale_(optimizer) (before clipping) → scaler.step(optimizer) → scaler.update(), plus scaler.state_dict() in the checkpoint. Don't mix bf16 with a scaler.
Order is always zero_grad -> forward -> backward -> (clip) -> step. Set set_to_none=True (the default since 2.0) for a small speed/memory win.
Step the LR scheduler once per epoch or per step per its contract; call it after optimizer.step(), never before.
Eval every N steps/epochs under both guards:

model.eval()
with torch.inference_mode():
    for x, y in val_loader:
        ...  # inference_mode > no_grad: also disables autograd version counters

Return to model.train() afterward. Forgetting eval() leaves dropout/BN in train mode; forgetting inference_mode/no_grad leaks graph memory and can OOM.

Track train and val loss/metrics every epoch; a train curve alone hides overfitting. Log LR too.
Checkpoint the best val metric, not the last epoch. Save a dict: model.state_dict(), optimizer.state_dict(), scheduler.state_dict(), scaler.state_dict() (if using fp16 AMP), epoch, best_metric, and the config — so training resumes exactly.
Early stop on a patience counter over the val metric; restore best weights before final eval/export.
Guard NaNs/Inf: if torch.isfinite(loss) is false, log inputs/LR and stop rather than poisoning weights. Exploding grads -> lower LR, add clip_grad_norm_, check init and normalization.
Minimize CPU-GPU syncs: accumulate metrics on-device and .item()/.cpu() once per logging interval, not per batch. Every .item(), print(tensor), or if loss > x: on a GPU tensor forces a blocking sync.
torch.compile(model) (mode "default" or "max-autotune") for real speedups; compile once outside the loop. Keep input shapes stable to avoid recompiles; use torch._dynamo.error_on_graph_break() to catch unintended graph breaks in hot regions.
Gradient accumulation for large effective batch: divide loss by accum_steps, step every accum_steps iterations.

Testing

pytest. Run on CPU with tiny tensors so tests are fast and hermetic.
Shape/dtype contracts: feed a dummy (B, ...) batch, assert output shape and dtype. Use torch.testing.assert_close (not ==) for float comparisons.
Overfit one batch: train on a single batch for ~100 steps and assert loss -> ~0. The fastest way to catch a broken forward/loss/backward wiring.
Determinism: same seed -> identical loss/output; assert with assert_close.
Gradient flow: after backward, assert key parameters have non-None, finite, non-zero .grad; catch layers accidentally detached or frozen.
No-leak / eval-mode: assert model.eval() changes BN/dropout behavior; assert normalization stats come from train config, not the batch.
Use torch.autograd.gradcheck on custom autograd.Functions.
Keep a smoke test that runs one full train+val step end-to-end in CI.

Security

torch.load defaults to weights_only=True (since 2.6) — keep it. Never pass weights_only=False on a checkpoint you didn't produce: legacy pickle loading is arbitrary code execution. Prefer safetensors (save_file/load_file) for any weights you share or download.
torch.hub.load(..., trust_repo=...) and loading arbitrary repos run remote code — pin a commit and review it; treat as untrusted.
Pin every dependency with a uv.lock; scan for CVEs. Match torch wheels to the CUDA/driver you actually run.
Don't log secrets, raw PII, or full dataset rows into experiment trackers.
Validate inference input shape/dtype/range at the API boundary before it reaches the model.

Do

Move model and tensors to the same resolved device; keep code CPU/GPU/MPS-agnostic via torch.accelerator.
Seed all RNGs and log config + git SHA + library versions at startup.
Fit all preprocessing on train only; persist stats in the checkpoint.
Use torch.amp.autocast + bf16, torch.compile, pin_memory, non_blocking=True, and channels_last for CNNs.
Save/restore state_dicts (model + optimizer + scheduler + scaler + epoch) and checkpoint the best val metric.
Wrap all eval in model.eval() + torch.inference_mode().
Clip grad norm; assert torch.isfinite(loss) each step.
Use torchrun + DDP (one process per GPU) for multi-GPU; FSDP2 when the model doesn't fit.

Avoid

torch.save(model) / torch.load of a whole model -> save model.state_dict(); whole-model pickle breaks on refactor and is unsafe.
torch.cuda.amp.autocast / torch.cuda.amp.GradScaler (deprecated 2.4) -> torch.amp.autocast("cuda") / torch.amp.GradScaler("cuda").
Hardcoded .cuda() / "cuda:0" -> resolved device variable + .to(device).
Eval without eval() and inference_mode() -> dropout/BN misbehave and memory leaks; always both.
nn.DataParallel -> DistributedDataParallel via torchrun.
transforms v1 -> torchvision.transforms.v2.
Fitting scaler/vocab/PCA on the full dataset, shuffling before a temporal split, or normalizing with val/test stats -> leakage; train-only stats.
.item() / print(tensor) / Python-if on GPU tensors inside the loop -> forces sync; aggregate on-device, sync once per log interval.
weights_only=False on untrusted checkpoints -> RCE; keep the default or use safetensors.
optimizer.step() before loss.backward(), or reusing stale grads by skipping zero_grad -> wrong updates.
total_loss += loss (graph-retaining) for logging -> use loss.detach() / .item().
softmax then CrossEntropyLoss -> double softmax; feed raw logits.

When you code

Make small, reviewable diffs — one concern per change (data, model, loop, config). Don't refactor the training loop and the architecture in the same PR.
Before proposing a change, run ruff format, ruff check, the type checker, and pytest. State what you ran.
After any change to the model or loop, run the overfit-one-batch sanity check and report the loss trend before claiming it trains.
When you touch reproducibility-sensitive code (seeding, splits, normalization), spell out the leakage/determinism reasoning in the PR description.
Never silently swap optimizer, LR, batch size, precision, or seed — these change results. Surface the change and its rationale.
Ask before: adding a heavy dependency or trainer framework; changing the data split or normalization scheme; enabling non-deterministic kernels in a run that must be reproducible; downloading weights/data from an unpinned remote source.