home/categories/framework-internals/nvidia-tensorrt-llm-claude-skills-perf-workload-profiling-skill-md
framework-internalsdevelopment

perf-workload-profiling

Code instrumentation for timing workloads. Two scenarios: (1) Training loop — inject manual timing to report per-iteration latency, throughput (samples/sec), and data load time. (2) Standalone kernel/op — write CUDA event timing code with warmup, per-iteration statistics, and anti-pattern avoidance. Also covers NVTX annotation for labeling profiler timelines. NOT for: running or analyzing profiler tools (nsys, ncu, Nsight Systems, Nsight Compute), writing kernels (Triton, CuTe, CUDA), applying optimizations (CUDA Graphs, gradient checkpointing, fusion), or interpreting roofline/SOL% metrics. Triggers: "measure throughput", "benchmark this function", "time my training loop", "samples per second", "NVTX annotate", "instrument my dataloader", "data load time", "kernel timing", "how do I time".

NVIDIA
maintainer
NVIDIA
Updated 4/8/2026
Stars
13335
Forks
2271
quick start

Installation and usage

Code instrumentation for timing workloads. Two scenarios: (1) Training loop — inject manual timing to report per-iteration latency, throughput (samples/sec), and data load time. (2) Standalone kernel/op — write CUDA event timing code with warmup, per-iteration statistics, and anti-pattern avoidance. Also covers NVTX annotation for labeling profiler timelines. NOT for: running or analyzing profiler tools (nsys, ncu, Nsight Systems, Nsight Compute), writing kernels (Triton, CuTe, CUDA), applying optimizations (CUDA Graphs, gradient checkpointing, fusion), or interpreting roofline/SOL% metrics. Triggers: "measure throughput", "benchmark this function", "time my training loop", "samples per second", "NVTX annotate", "instrument my dataloader", "data load time", "kernel timing", "how do I time".

Installation
$ install --globalskills.sh
Usage

Once installed, you can use this skill by running the following command in your terminal:

skills use perf-workload-profiling