cuda-kernels

作者: huggingface

提供針對NVIDIA GPU(H100、A100、T4)撰寫和基準測試優化CUDA核心的指導,目標為HuggingFace diffusers和transformers…

npx skills add https://github.com/huggingface/kernels --skill cuda-kernels

CUDA Kernels for Diffusers & Transformers

This skill provides patterns and guidance for developing optimized CUDA kernels targeting NVIDIA GPUs (H100, A100, T4) for use with HuggingFace diffusers and transformers libraries.

Quick Start

Diffusers (Video/Image Generation)

For benchmarking kernel performance:

# Benchmark with optimized kernels (6% end-to-end speedup)
python generate_video.py --use-optimized-kernels

# Benchmark baseline with torch.compile (34% speedup)
python generate_video.py --no-optimized-kernels --compile

# Compare configurations (note: --compile and --use-optimized-kernels are mutually exclusive)
python generate_video.py --use-optimized-kernels && \
python generate_video.py --no-optimized-kernels --compile

For a minimal diffusers integration example (~150 lines):

python scripts/ltx_kernel_injection_example.py

Transformers (LLMs)

For a minimal transformers integration example (~120 lines):

python scripts/transformers_injection_example.py

HuggingFace Kernels Hub

Load pre-compiled kernels from HuggingFace Hub (no local compilation):

from kernels import get_kernel

# Load optimized activation kernels
activation = get_kernel("kernels-community/activation", version=1)

# Use the kernel
y = torch.empty_like(x)
activation.gelu_fast(y, x)

For a complete HuggingFace Kernels example:

python scripts/huggingface_kernels_example.py

Isolated Kernel Micro-benchmarks

python benchmark_rmsnorm.py

Supported Libraries & Models

LibrarySupported ModelsKey Kernels
diffusersLTX-Video, Stable Diffusion, FLUX, DiTRMSNorm, GEGLU, RoPE, AdaLN
transformersLLaMA, Mistral, Qwen, FalconRMSNorm, Attention
GPUCompute CapabilityGuide
H100sm_90h100-optimization-guide.md
A100sm_80a100-optimization-guide.md
T4sm_75t4-optimization-guide.md

When This Skill Applies

Use this skill when:

  • Benchmarking kernel performance against baseline implementations
  • Writing new CUDA kernels for diffusion models or LLMs
  • Optimizing existing kernels for H100, A100, or T4 architecture
  • Implementing custom attention, normalization, or activation layers
  • Integrating kernels with diffusers pipelines (LTX-Video, Stable Diffusion, FLUX, DiT)
  • Integrating kernels with transformers models (LLaMA, Mistral, Qwen)
  • Debugging kernel performance issues on NVIDIA GPUs

Working Example

A complete working example is available at examples/ltx_video/. This demonstrates:

  • Custom CUDA kernels (RMSNorm, RoPE 3D, GEGLU, AdaLN)
  • Build system setup with setup.py, build.toml, and flake.nix
  • PyTorch C++ bindings and Python API
  • Benchmarking script for comparing optimized vs baseline performance

Benchmarking Kernels

Use the benchmark script to measure kernel performance:

# Full benchmark with all options
python scripts/benchmark_example.py \
    --use-optimized-kernels \
    --compile \
    --batch-size 1 \
    --num-frames 161 \
    --height 512 \
    --width 768 \
    --steps 50 \
    --warmup-iterations 2

Benchmark Script Options

OptionDefaultDescription
--use-optimized-kernelsautoUse custom H100 CUDA kernels
--no-optimized-kernels-Use baseline implementation
--compilefalseEnable torch.compile on transformer
--batch-size1Number of videos per prompt
--num-frames161Number of frames to generate
--height512Video height in pixels
--width768Video width in pixels
--steps50Denoising steps
--warmup-iterations2Warmup runs before benchmark

Example Benchmark Results

End-to-End Video Generation (49 frames, 30 steps, H100 80GB):

ConfigurationTime (s)it/sSpeedupNotes
Baseline (no compile)2.8712.581.00xReference
Optimized Kernels2.7013.521.06x6% faster
Baseline + torch.compile2.1419.051.34x34% faster

Important: --use-optimized-kernels and --compile are currently mutually exclusive. Custom kernels require PyTorch custom op registration to work with torch.compile.

Key metrics to capture:

  • Device: GPU model (e.g., NVIDIA H100 80GB HBM3)
  • Precision: Data type used (e.g., bfloat16)
  • Resolution: Width x Height (e.g., 768x512)
  • Frames: Number of frames generated (e.g., 49, 161)

RMSNorm Micro-benchmarks

The vectorized RMSNorm kernel achieves 2.67x average speedup over PyTorch baseline:

ShapeCustom (ms)PyTorch (ms)Speedup
[1×1024×2048]0.0190.0653.37x
[2×1024×2048]0.0240.0733.04x
[4×1024×2048]0.0360.0932.58x
[2×4096×3072]0.0870.2082.41x
[4×4096×3072]0.1570.3922.49x

Bandwidth efficiency: 38% of H100's theoretical 3.35 TB/s

Why end-to-end speedup is smaller: RMSNorm accounts for ~5% of total compute in LTX-Video. The remaining time is spent in attention (Flash Attention/SDPA), linear projections, and VAE decode.

Project Structure

.claude/skills/cuda-kernels/
├── scripts/
│   ├── benchmark_example.py              # End-to-end video generation benchmark
│   ├── benchmark_rmsnorm.py              # Isolated RMSNorm micro-benchmark
│   ├── ltx_kernel_injection_example.py   # Minimal diffusers integration (~150 lines)
│   ├── transformers_injection_example.py # Minimal transformers integration (~120 lines)
│   └── huggingface_kernels_example.py    # HuggingFace Kernels Hub integration
├── references/
│   ├── diffusers-integration.md          # Complete diffusers integration guide
│   ├── transformers-integration.md       # Complete transformers integration guide
│   ├── huggingface-kernels-integration.md # HuggingFace Kernels Hub (get_kernel) guide
│   ├── troubleshooting.md                # Common issues and solutions
│   ├── kernel-templates.md               # CUDA kernel templates (includes vectorized)
│   ├── h100-optimization-guide.md        # H100 (Hopper) optimization deep dive
│   ├── a100-optimization-guide.md        # A100 (Ampere) optimization deep dive
│   └── t4-optimization-guide.md          # T4 (Turing) optimization deep dive
└── SKILL.md                              # This file

examples/ltx_video/                  # Complete working example
├── kernel_src/
│   └── rmsnorm.cu                  # Vectorized RMSNorm kernel (2.67x faster)
├── torch-ext/                      # PyTorch bindings
├── generate_video.py               # Full benchmark script
├── benchmark_rmsnorm.py            # Isolated kernel benchmark
└── setup.py                        # pip install -e .

GPU Architecture Reference

H100 (Hopper) - Primary Target

SpecValueOptimization Impact
SMs132Grid sizing: aim for multiples of 132
Threads/SM2048Max 16 blocks of 128 threads per SM
Shared Memory192 KB/SMLarge tiles possible
L2 Cache50 MBReuse across blocks
Memory BW3.35 TB/sCoalesced access critical
Warp Size32All reductions use warp shuffles

Quick Comparison (H100 vs A100 vs T4)

SpecH100A100T4
SMs13210840
Memory BW3.35 TB/s2.0 TB/s320 GB/s
Shared Mem/SM192 KB164 KB64 KB
BF16 SupportYesYesNo (FP16 only)
Compute Capsm_90sm_80sm_75

See detailed guides: H100 | A100 | T4

Core Kernel Patterns

Vectorized Memory Access (Critical for Performance)

BFloat16 vectorization using __nv_bfloat162:

// Load 2 bfloat16 elements at once (32-bit load)
const __nv_bfloat162* vec_input = reinterpret_cast<const __nv_bfloat162*>(row_input);

#pragma unroll 4
for (int i = tid; i < vec_hidden; i += stride) {
    __nv_bfloat162 v = vec_input[i];
    float v0 = __bfloat162float(v.x);
    float v1 = __bfloat162float(v.y);
    sum_sq += v0 * v0 + v1 * v1;
}

FP16 vectorization using __half2:

const __half2* vec_input = reinterpret_cast<const __half2*>(row_input);
__half2 v = vec_input[i];
float v0 = __half2float(v.x);
float v1 = __half2float(v.y);

FP32 vectorization using float4:

const float4* vec_input = reinterpret_cast<const float4*>(row_input);
float4 v = vec_input[i];
sum_sq += v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w;

Warp Shuffle Reductions

template <typename T>
__device__ __forceinline__ T warp_reduce_sum(T val) {
    #pragma unroll
    for (int offset = 16; offset > 0; offset >>= 1) {
        val += __shfl_xor_sync(0xffffffff, val, offset);
    }
    return val;
}

Block Sizes for Attention

  • BLOCK_SIZE_M = 128, BLOCK_SIZE_N = 64, BLOCK_SIZE_K = 64
  • NUM_WARPS = 8

Thread Configuration

For element-wise ops (RoPE, GEGLU):

constexpr int BLOCK_SIZE = 256;
int num_blocks = (total_elements + BLOCK_SIZE - 1) / BLOCK_SIZE;

For reduction ops (LayerNorm, RMSNorm) with vectorization:

// Divide by 2 for bf16/fp16 vectorized access
int threads = min(hidden_size / 2, MAX_THREADS);
threads = max(threads, WARP_SIZE);
threads = (threads + 32 - 1) / 32 * 32;  // Round to warp boundary

Supported Data Types

All kernels support three precision modes:

  • __half (FP16) - Default for inference
  • __nv_bfloat16 (BF16) - Preferred for training
  • float (FP32) - Reference/debugging

Building Kernels

With Nix (Recommended)

nix run .#build-and-copy --max-jobs 2 --cores 8 -L

With pip/uv

uv pip install -e .

build.toml Configuration

[general]
name = "ltx_kernels"
backends = ["cuda"]

[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
cuda-capabilities = ["9.0"]

Library Integration

HuggingFace Kernels Hub (get_kernel)

See huggingface-kernels-integration.md for the complete guide.

Load pre-compiled, optimized kernels directly from HuggingFace Hub without local compilation:

from kernels import get_kernel, has_kernel

# Check availability and load
if has_kernel("kernels-community/activation"):
    activation = get_kernel("kernels-community/activation", version=1)

    # Use the kernel
    x = torch.randn((4, 4), dtype=torch.float16, device="cuda")
    y = torch.empty_like(x)
    activation.gelu_fast(y, x)

Key functions:

  • get_kernel(repo_id, version=None) - Download and load kernel from Hub
  • has_kernel(repo_id) - Check if compatible build exists
  • get_local_kernel(path) - Load from local directory (development)

Popular community kernels:

  • kernels-community/activation - GELU, SiLU, etc.
  • kernels-community/flash-attn - Flash Attention 2
  • kernels-community/triton-layer-norm - LayerNorm, RMSNorm

Diffusers Integration (Video/Image Generation)

See diffusers-integration.md for the complete guide.

Transformers Integration (LLMs)

See transformers-integration.md for the complete guide.

Key differences from diffusers:

  • Transformers RMSNorm always has weights (no elementwise_affine=False)
  • Use 'RMSNorm' in class_name to match LlamaRMSNorm, MistralRMSNorm, etc.
  • Check for variance_epsilon (LLaMA) or eps (others) for epsilon
  • No set_processor() pattern - use Flash Attention 2 instead

Minimal transformers pattern:

from transformers import AutoModelForCausalLM
from ltx_kernels import rmsnorm

def patch_rmsnorm(model):
    for name, module in model.named_modules():
        if 'RMSNorm' in type(module).__name__:
            eps = getattr(module, 'variance_epsilon', None) or getattr(module, 'eps', 1e-6)
            def make_forward(mod, epsilon):
                def forward(x):
                    return rmsnorm(x, mod.weight, eps=epsilon)
                return forward
            module.forward = make_forward(module, eps)

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
patch_rmsnorm(model)

Diffusers Critical Pitfalls

1. RMSNorm Weight May Be None

LTX-Video uses elementwise_affine=False for some RMSNorm modules:

# Transformer blocks: NO WEIGHT
self.norm1 = RMSNorm(dim, elementwise_affine=False)

# Attention modules: HAS WEIGHT
self.norm_q = torch.nn.RMSNorm(..., elementwise_affine=True)

Solution: Handle both cases:

has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
    output = rmsnorm(x, module.weight, eps=eps)
else:
    weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
    output = rmsnorm(x, weight, eps=eps)

2. Diffusers RMSNorm != torch.nn.RMSNorm

# WRONG - misses diffusers RMSNorm
if isinstance(module, torch.nn.RMSNorm):

# CORRECT - catches all RMSNorm variants
if type(module).__name__ == 'RMSNorm':

3. LTX-Video Uses GELU, Not GEGLU

LTX-Video uses activation_fn="gelu-approximate". Don't patch GEGLU for LTX-Video.

4. Inject Kernels BEFORE CPU Offloading

pipe = LTXPipeline.from_pretrained(...)
pipe.to("cuda")
inject_optimized_kernels(pipe)  # BEFORE offloading
pipe.enable_model_cpu_offload()  # Now safe

Minimal Integration Pattern

from diffusers import LTXPipeline
from ltx_kernels import rmsnorm

def patch_rmsnorm_modules(model):
    """Patch all RMSNorm modules to use custom kernel."""
    for name, module in model.named_modules():
        if type(module).__name__ == 'RMSNorm':
            eps = getattr(module, 'eps', 1e-6)
            has_weight = hasattr(module, 'weight') and module.weight is not None

            if has_weight:
                def make_forward(mod, epsilon):
                    def forward(x):
                        return rmsnorm(x, mod.weight, eps=epsilon)
                    return forward
                module.forward = make_forward(module, eps)
            else:
                def make_forward(epsilon):
                    def forward(x):
                        w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
                        return rmsnorm(x, w, eps=epsilon)
                    return forward
                module.forward = make_forward(eps)

# Usage
pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")
patch_rmsnorm_modules(pipe.transformer)
pipe.enable_model_cpu_offload()

Kernel-Specific Guidelines

RMSNorm

  • Input layout: [..., hidden_size]
  • Epsilon default: 1e-6
  • Weight may be None if elementwise_affine=False
  • Vectorization: Use __nv_bfloat162 for BF16, __half2 for FP16, float4 for FP32
  • Performance: 2.67x faster than PyTorch with vectorized implementation
  • Bandwidth: Achieves ~38% of H100's 3.35 TB/s theoretical bandwidth

RoPE

  • 1D: [batch, seq, heads, head_dim] - for text
  • 3D: [batch, t*h*w, heads, head_dim] - for video
  • LTX-Video computes its own RoPE via LTXVideoRotaryPosEmbed

GEGLU vs GELU

  • GEGLU: Input [batch, seq, 2*hidden] -> Output [batch, seq, hidden]
  • GELU: Standard activation
  • LTX-Video uses GELU, NOT GEGLU

AdaLN

  • Formula: norm(x) * weight * (1 + scale) + shift
  • Used in DiT blocks for conditioning

Performance Profiling

# NVIDIA Nsight Systems
nsys profile -o profile python your_script.py

# NVIDIA Nsight Compute
ncu --set full -o metrics python your_script.py

Common Issues

See troubleshooting.md for all common issues and solutions.

Quick fixes:

  • "NoneType has no attribute contiguous": RMSNorm weight is None, create ones
  • isinstance() not matching: Use type(module).__name__ instead
  • GEGLU not called: Model uses GELU, not GEGLU
  • Patching doesn't persist: Inject before enable_model_cpu_offload()
  • torch.compile fails with custom kernels: See below

torch.compile Compatibility

Custom CUDA kernels and torch.compile are mutually exclusive unless you register the kernel as a PyTorch custom op.

Error message:

torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped

Workaround options:

  1. Use --use-optimized-kernels without --compile (6% speedup)
  2. Use --compile without custom kernels (34% speedup)
  3. Register kernel as custom op (advanced, requires torch.library)

To register as custom op (for torch.compile compatibility):

import torch

@torch.library.custom_op("ltx_kernels::rmsnorm", mutates_args={"out"})
def rmsnorm(out: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, eps: float) -> None:
    ops.rmsnorm_forward(out, input.contiguous(), weight.contiguous(), eps)

@rmsnorm.register_fake
def _(out, input, weight, eps):
    pass  # No shape changes

See Also

Scripts

Integration Guides

GPU Optimization Guides

Reference

External Resources

來自 huggingface 的更多技能

Hugging Face Cli
huggingface
Execute Hugging Face Hub operations using the `hf` CLI. Use when the user needs to download models/datasets/spaces, upload files to Hub repositories, create repos, manage local cache, or run compute jobs on HF infrastructure. Covers authentication, file transfers, repository creation, cache operations, and cloud compute.
official
Hugging Face Datasets
huggingface
在 Hugging Face Hub 上建立與管理資料集。支援初始化儲存庫、定義配置/系統提示、串流更新資料列,以及基於 SQL 的資料集查詢/轉換。設計與 HF MCP 伺服器搭配使用,以實現完整的資料集工作流程。
official
Hugging Face Evaluation
huggingface
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
official
Hugging Face Jobs
huggingface
在 Hugging Face Jobs 基礎架構上執行任何工作負載。涵蓋 UV 腳本、基於 Docker 的工作、硬體選擇、成本估算、使用令牌進行身份驗證、機密管理、超時配置以及結果持久化。專為通用計算工作負載設計,包括數據處理、推理、實驗、批次作業以及任何基於 Python 的任務。
official
Hugging Face Model Trainer
huggingface
Train or fine-tune language models using TRL (Transformer Reinforcement Learning) on Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on dataset preparation, hardware selection, cost estimation, and model persistence.
official
Hugging Face Paper Publisher
huggingface
在 Hugging Face Hub 上發佈與管理研究論文。支援建立論文頁面、將論文連結至模型/資料集、聲明作者身份,以及生成基於 Markdown 的專業研究文章。
official
Hugging Face Tool Builder
huggingface
使用 Hugging Face API 建立可重複使用的腳本與工具。適用於串接或組合 API 呼叫,或需要重複執行/自動化的任務。建立可重複使用的命令列腳本,從 Hugging Face Hub 擷取、豐富或處理資料。
official
Hugging Face Trackio
huggingface
Track and visualize ML training experiments with Trackio. Use when logging metrics during training (Python API) or retrieving/analyzing logged metrics (CLI). Supports real-time dashboard visualization, HF Space syncing, and JSON output for automation.
official