Back to List From collection "SkillsBench"

pytorch

Name: pytorch
Author: benchflow-ai

by benchflow-ai890173GitHub

Building and training neural networks with PyTorch. Use when implementing deep learning models, training loops, data pipelines, model optimization with torch.compile, distributed training, or deploying PyTorch models.

Unlock Deep Analysis

Use AI to visualize the workflow and generate a realistic output preview for this skill.

Development

Compatible Agents

Claude Code

~/.claude/skills/

Codex CLI

~/.codex/skills/

Gemini CLI

~/.gemini/skills/

OpenCode

~/.opencode/skills/

OpenClaw

~/.openclaw/skills/

GitHub Copilot

~/.copilot/skills/

Cursor

~/.cursor/skills/

Windsurf

~/.codeium/windsurf/skills/

Cline

~/.cline/skills/

Roo Code

~/.roo/skills/

Kiro

~/.kiro/skills/

Junie

~/.junie/skills/

Augment Code

~/.augment/skills/

Warp

~/.warp/skills/

Goose

~/.config/goose/skills/

SKILL.md

Train vs Eval Mode

model.train() enables dropout, BatchNorm updates — default after init
model.eval() disables dropout, uses running stats — MUST call for inference
Mode is sticky — train/eval persists until explicitly changed
model.eval() doesn't disable gradients — still need torch.no_grad()

Gradient Control

torch.no_grad() for inference — reduces memory, speeds up computation
loss.backward() accumulates gradients — call optimizer.zero_grad() before backward
zero_grad() placement matters — before forward pass, not after backward
.detach() to stop gradient flow — prevents memory leak in logging

Device Management

Model AND data must be on same device — model.to(device) and tensor.to(device)
.cuda() vs .to('cuda') — both work, .to(device) more flexible
CUDA tensors can't convert to numpy directly — .cpu().numpy() required
torch.device('cuda' if torch.cuda.is_available() else 'cpu') — portable code

DataLoader

num_workers > 0 uses multiprocessing — Windows needs if __name__ == '__main__':
pin_memory=True with CUDA — faster transfer to GPU
Workers don't share state — random seeds differ per worker, set in worker_init_fn
Large num_workers can cause memory issues — start with 2-4, increase if CPU-bound

Saving and Loading

torch.save(model.state_dict(), path) — recommended, saves only weights
Loading: create model first, then model.load_state_dict(torch.load(path))
map_location for cross-device — torch.load(path, map_location='cpu') if saved on GPU
Saving whole model pickles code path — breaks if code changes

In-place Operations

In-place ops end with _ — tensor.add_(1) vs tensor.add(1)
In-place on leaf variable breaks autograd — error about modified leaf
In-place on intermediate can corrupt gradient — avoid in computation graph
tensor.data bypasses autograd — legacy, prefer .detach() for safety

Memory Management

Accumulated tensors leak memory — .detach() logged metrics
torch.cuda.empty_cache() releases cached memory — but doesn't fix leaks
Delete references and call gc.collect() — before empty_cache if needed
with torch.no_grad(): prevents graph storage — crucial for validation loop

Common Mistakes

BatchNorm with batch_size=1 fails in train mode — use eval mode or track_running_stats=False
Loss function reduction default is 'mean' — may want 'sum' for gradient accumulation
cross_entropy expects logits — not softmax output
.item() to get Python scalar — .numpy() or [0] deprecated/error

Source: https://github.com/benchflow-ai/SkillsBench#registry-terminal_bench_2.0-full_batch_reviewed-terminal_bench_2_0_torch-tensor-parallelism-environment-skills-pytorch

Content curated from original sources, copyright belongs to authors

Grade B

-AI Score

Best Practices

Checking...

Try this Skill

User Rating

USER RATING

0UP

0DOWN

Loading files...

WORKS WITH

Claude

Codex

Gemini