perf-workload-profiling

Name: perf-workload-profiling
Author: NVIDIA

Code instrumentation for timing workloads. Two scenarios: (1) Training loop — inject manual timing to report per-iteration latency, throughput (samples/sec), and data load time. (2) Standalone kernel/op — write CUDA event timing code with warmup, per-iteration statistics, and anti-pattern avoidance. Also covers NVTX annotation for labeling profiler timelines. NOT for: running or analyzing profiler tools (nsys, ncu, Nsight Systems, Nsight Compute), writing kernels (Triton, CuTe, CUDA), applying optimizations (CUDA Graphs, gradient checkpointing, fusion), or interpreting roofline/SOL% metrics. Triggers: "measure throughput", "benchmark this function", "time my training loop", "samples per second", "NVTX annotate", "instrument my dataloader", "data load time", "kernel timing", "how do I time".

View Source framework-internals

maintainer

NVIDIA

Updated 4/8/2026

Stars

13335

Forks

2271

quick start

Installation and usage

Installation

$ install --globalskills.sh

Usage

Once installed, you can use this skill by running the following command in your terminal:

skills use perf-workload-profiling