Skip to main content
  1. Portfolio/

PyTorch DDP Scaling Benchmark

Will Paik
Author
Will Paik
I optimize large-scale GPU clusters for AI/ML workloads. Outside of work, I build a mini-supercomputer from consumer hardware and document every step of it here.

A reproducible benchmark suite for characterizing PyTorch Distributed Data Parallel (DDP) training performance on NVIDIA GPU clusters. Built for pre-production validation of large-scale HPC infrastructure, and generalized for broader use on any Slurm-based system.

Benchmark architecture overview

What it does
#

The benchmark runs ResNet training on synthetic on-device data to isolate GPU compute and NCCL communication from storage I/O. It measures two complementary scaling modes:

Weak scaling keeps per-GPU batch size fixed while adding GPUs. This answers whether each GPU stays productive as the system grows. Throughput per GPU should stay flat; a drop indicates communication overhead.

Strong scaling keeps the global batch fixed while adding GPUs. This answers how much faster the same workload runs with more resources. The result is expressed as speedup and parallel efficiency relative to a single-GPU baseline.

Results are collected as JSON per run and aggregated into tables (text) and plots (PNG). GPU activity is sampled during measurement via nvidia-smi so low utilization configurations are flagged automatically.

Why it was built
#

Statewide AI research infrastructure serving hundreds of researchers needs to be validated before opening to users. This benchmark was developed to stress-test GPU compute and inter-node communication on B200 and RTX Pro 6000 Blackwell hardware before cluster launch, and to produce numbers that can be reported to stakeholders and researchers in a reproducible way.

Technical details
#

  • Language: Python 3.11+, Bash
  • Framework: PyTorch 2.7+ with torch.distributed / NCCL
  • Scheduler: Slurm (torchrun + srun pattern for multi-node)
  • Precision: BF16 by default (FP32 and FP16 supported)
  • Measurement: time-based with synchronized step counting across ranks, warmup phase to absorb cuDNN autotuning and optional torch.compile JIT cost
  • Outputs: per-run JSON, terminal tables, matplotlib PNG plots

Repository
#

github.com/willgpaik/pytorch-ddp-scaling-benchmark