Data & AI · Python 3.13 · PyTorch 2.12 · torchvision 0.27 · CUDA 13
PyTorch / ML
Reproducible training loops, device-safe tensors, no silent NaNs.
Updated 5 Jul 2026 · CC0
AGENTS.mdrepo rootYou are writing PyTorch training and modeling code for research and production. "Good" means reproducible, leak-free, device-agnostic, memory-efficient, and numerically correct — every run seeded and logged, every eval under no_grad/eval(), checkpoints that restore exactly, and no silent NaNs. Prefer plain, explicit torch over framework magic; reach for a trainer library only when it earns its keep.
Stack
- Python 3.13 (3.12–3.14 supported; pin one in
pyproject.toml). torch requires 3.10+. - torch 2.12.1 — install CUDA/ROCm/XPU wheels from the official index, matched to your driver:
uv pip install torch==2.12.1 --index-url https://download.pytorch.org/whl/cu130(CUDA 13). Neverpip install torchfrom PyPI for GPU boxes. - torchvision 0.27.1, torchaudio 2.12.1 — versions are lockstep with torch; upgrade all three together.
- NumPy 2.5.x (2.x ABI; ensure C-extension deps are 2.x-built).
- safetensors for weight serialization/sharing (not pickle). einops for readable reshapes. torchmetrics for metrics.
- Config: Hydra + OmegaConf or pydantic-settings. Tracking: Weights & Biases, MLflow, or TensorBoard (
torch.utils.tensorboard). Data/model versioning: DVC or W&B Artifacts. - Tooling: uv 0.11 (env + lockfile), ruff 0.15 (lint + format), pyright or mypy (type check), pytest 9 (tests).
- AMP:
torch.amp.autocast("cuda")andtorch.amp.GradScaler("cuda"). Thetorch.cuda.amp.*forms are deprecated (since 2.4) — do not use them. - Distributed:
torchrun+DistributedDataParallel; FSDP2 (torch.distributed.fsdp.fully_shard) for sharded large-model training.DataParallelis legacy — never use it.
Project conventions
src/<pkg>/
data/ datasets.py, transforms.py, datamodule.py
models/ layers.py, <arch>.py # each nn.Module in its own file
train.py # entrypoint: build -> fit -> checkpoint
eval.py
engine.py # train_one_epoch / evaluate loops
utils/ seed.py, distributed.py, logging.py, checkpoint.py
configs/ *.yaml (Hydra)
tests/
pyproject.toml
- One
nn.Moduleper file;forwardreads top-to-bottom with no hidden global state. - Absolute imports (
from src.models.resnet import ResNet); noimport *. - Ruff formats and lints (
ruff format,ruff check --fix); line length 100. Enable rule setsE,F,I,UP,B,NPY,PTH,SIM. - Full type hints on public functions. Annotate tensor semantics in the signature or docstring (
x: Tensor # (B, C, H, W)); document dtype/device expectations. - No compute at import time; guard entrypoints with
if __name__ == "__main__":.
Reproducibility
- Seed everything from one function, once, before any RNG use:
def seed_everything(seed: int) -> None:
random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
- For strict determinism:
torch.use_deterministic_algorithms(True),torch.backends.cudnn.benchmark = False, and set envCUBLAS_WORKSPACE_CONFIG=:4096:8. Accept the speed cost; keepbenchmark = Trueonly when you explicitly want throughput over bit-exactness. - Seed DataLoader workers so shuffles/augmentations are reproducible: pass a
generator=torch.Generator().manual_seed(seed)and aworker_init_fnthat seedsnumpy/randomper worker fromtorch.initial_seed(). - Log the full resolved config, all hyperparameters, git commit SHA,
torch.__version__, CUDA version, and GPU name into the run tracker at startup. A checkpoint without its config is unreproducible. - Version the dataset (hash or DVC/artifact ref) alongside code. Record the exact data split.
Data
- Subclass
torch.utils.data.Dataset(__len__,__getitem__) or anIterableDatasetfor streaming. Do I/O and decode in__getitem__; keep tensors CPU-side there. - DataLoader for GPU training:
num_workers=min(8, os.cpu_count()),pin_memory=True,persistent_workers=True,prefetch_factor=2,drop_last=Truefor train. Transfer withx.to(device, non_blocking=True)(pairs withpin_memory). - No leakage. Split into train/val/test before fitting anything. Fit scalers, vocab, class weights, PCA, and normalization stats on train only, then apply frozen to val/test. For grouped data (patient, user, session), split by group so no group spans two sets.
- Normalize with train statistics only. Store
mean/stdin the checkpoint and reuse at inference — recomputing on eval data is leakage. shuffle=Truefor train,Falsefor val/test. Never shuffle before a temporal split.- Augment train only; val/test get deterministic preprocessing. Use
torchvision.transforms.v2(the v1transformsAPI is legacy).
Model
- Subclass
nn.Module; register submodules/parameters as attributes (ornn.ModuleList/ModuleDict— a plain Python list hides params from.parameters()and.to()). - Resolve one
devicecentrally and be device-agnostic:
device = torch.accelerator.current_accelerator() if torch.accelerator.is_available() else torch.device("cpu")
Never hardcode .cuda() or "cuda:0". Move model and every input tensor to the same device; model.to(device) mutates in place, but tensor.to(device) returns a copy — reassign it.
- Init explicitly (
nn.init.kaiming_normal_, etc.); do not rely on defaults for custom layers. Usenn.LayerNorm/nn.BatchNorm2dcorrectly (BN needseval()to freeze running stats). forwardreturns raw logits; keep loss (nn.CrossEntropyLoss, which applies log-softmax internally — do not softmax first) and activation out of the model.set_float32_matmul_precision("high")to enable TF32 matmuls on Ampere+ when full fp32 precision isn't required.
Training loop
- Canonical step:
model.train()
for x, y in train_loader:
x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
optimizer.zero_grad(set_to_none=True)
with torch.amp.autocast("cuda", dtype=torch.bfloat16): # bf16: no GradScaler
loss = criterion(model(x), y)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
- AMP: prefer
dtype=torch.bfloat16on Ampere+ — it has fp32 range, so noGradScaler: use the plainloss.backward()/optimizer.step()above. Only on older GPUs needingtorch.float16do you add a scaler, and then the flow isscaler.scale(loss).backward()→scaler.unscale_(optimizer)(before clipping) →scaler.step(optimizer)→scaler.update(), plusscaler.state_dict()in the checkpoint. Don't mix bf16 with a scaler. - Order is always
zero_grad -> forward -> backward -> (clip) -> step. Setset_to_none=True(the default since 2.0) for a small speed/memory win. - Step the LR scheduler once per epoch or per step per its contract; call it after
optimizer.step(), never before. - Eval every N steps/epochs under both guards:
model.eval()
with torch.inference_mode():
for x, y in val_loader:
... # inference_mode > no_grad: also disables autograd version counters
Return to model.train() afterward. Forgetting eval() leaves dropout/BN in train mode; forgetting inference_mode/no_grad leaks graph memory and can OOM.
- Track train and val loss/metrics every epoch; a train curve alone hides overfitting. Log LR too.
- Checkpoint the best val metric, not the last epoch. Save a dict:
model.state_dict(),optimizer.state_dict(),scheduler.state_dict(),scaler.state_dict()(if using fp16 AMP),epoch,best_metric, and the config — so training resumes exactly. - Early stop on a patience counter over the val metric; restore best weights before final eval/export.
- Guard NaNs/Inf: if
torch.isfinite(loss)is false, log inputs/LR and stop rather than poisoning weights. Exploding grads -> lower LR, addclip_grad_norm_, check init and normalization. - Minimize CPU-GPU syncs: accumulate metrics on-device and
.item()/.cpu()once per logging interval, not per batch. Every.item(),print(tensor), orif loss > x:on a GPU tensor forces a blocking sync. torch.compile(model)(mode"default"or"max-autotune") for real speedups; compile once outside the loop. Keep input shapes stable to avoid recompiles; usetorch._dynamo.error_on_graph_break()to catch unintended graph breaks in hot regions.- Gradient accumulation for large effective batch: divide loss by
accum_steps, step everyaccum_stepsiterations.
Testing
- pytest. Run on CPU with tiny tensors so tests are fast and hermetic.
- Shape/dtype contracts: feed a dummy
(B, ...)batch, assert output shape anddtype. Usetorch.testing.assert_close(not==) for float comparisons. - Overfit one batch: train on a single batch for ~100 steps and assert loss -> ~0. The fastest way to catch a broken forward/loss/backward wiring.
- Determinism: same seed -> identical loss/output; assert with
assert_close. - Gradient flow: after
backward, assert key parameters have non-None, finite, non-zero.grad; catch layers accidentally detached or frozen. - No-leak / eval-mode: assert
model.eval()changes BN/dropout behavior; assert normalization stats come from train config, not the batch. - Use
torch.autograd.gradcheckon customautograd.Functions. - Keep a smoke test that runs one full train+val step end-to-end in CI.
Security
torch.loaddefaults toweights_only=True(since 2.6) — keep it. Never passweights_only=Falseon a checkpoint you didn't produce: legacy pickle loading is arbitrary code execution. Prefer safetensors (save_file/load_file) for any weights you share or download.torch.hub.load(..., trust_repo=...)and loading arbitrary repos run remote code — pin a commit and review it; treat as untrusted.- Pin every dependency with a
uv.lock; scan for CVEs. Match torch wheels to the CUDA/driver you actually run. - Don't log secrets, raw PII, or full dataset rows into experiment trackers.
- Validate inference input shape/dtype/range at the API boundary before it reaches the model.
Do
- Move model and tensors to the same resolved
device; keep code CPU/GPU/MPS-agnostic viatorch.accelerator. - Seed all RNGs and log config + git SHA + library versions at startup.
- Fit all preprocessing on train only; persist stats in the checkpoint.
- Use
torch.amp.autocast+ bf16,torch.compile,pin_memory,non_blocking=True, andchannels_lastfor CNNs. - Save/restore
state_dicts (model + optimizer + scheduler + scaler + epoch) and checkpoint the best val metric. - Wrap all eval in
model.eval()+torch.inference_mode(). - Clip grad norm; assert
torch.isfinite(loss)each step. - Use
torchrun+ DDP (one process per GPU) for multi-GPU; FSDP2 when the model doesn't fit.
Avoid
torch.save(model)/torch.loadof a whole model -> savemodel.state_dict(); whole-model pickle breaks on refactor and is unsafe.torch.cuda.amp.autocast/torch.cuda.amp.GradScaler(deprecated 2.4) ->torch.amp.autocast("cuda")/torch.amp.GradScaler("cuda").- Hardcoded
.cuda()/"cuda:0"-> resolveddevicevariable +.to(device). - Eval without
eval()andinference_mode()-> dropout/BN misbehave and memory leaks; always both. nn.DataParallel->DistributedDataParallelviatorchrun.transformsv1 ->torchvision.transforms.v2.- Fitting scaler/vocab/PCA on the full dataset, shuffling before a temporal split, or normalizing with val/test stats -> leakage; train-only stats.
.item()/print(tensor)/ Python-ifon GPU tensors inside the loop -> forces sync; aggregate on-device, sync once per log interval.weights_only=Falseon untrusted checkpoints -> RCE; keep the default or use safetensors.optimizer.step()beforeloss.backward(), or reusing stale grads by skippingzero_grad-> wrong updates.total_loss += loss(graph-retaining) for logging -> useloss.detach()/.item().softmaxthenCrossEntropyLoss-> double softmax; feed raw logits.
When you code
- Make small, reviewable diffs — one concern per change (data, model, loop, config). Don't refactor the training loop and the architecture in the same PR.
- Before proposing a change, run
ruff format,ruff check, the type checker, andpytest. State what you ran. - After any change to the model or loop, run the overfit-one-batch sanity check and report the loss trend before claiming it trains.
- When you touch reproducibility-sensitive code (seeding, splits, normalization), spell out the leakage/determinism reasoning in the PR description.
- Never silently swap optimizer, LR, batch size, precision, or seed — these change results. Surface the change and its rationale.
- Ask before: adding a heavy dependency or trainer framework; changing the data split or normalization scheme; enabling non-deterministic kernels in a run that must be reproducible; downloading weights/data from an unpinned remote source.
Drop it in your repo
Save these rules as AGENTS.md, CLAUDE.md, .cursorrules, .windsurfrules or .github/copilot-instructions.md — your agent instantly codes to the same standard on Python 3.13 · PyTorch 2.12 · torchvision 0.27 · CUDA 13.