gptq

bởi firecrawl

Lượng tử hóa 4-bit sau huấn luyện cho các LLM với độ suy giảm độ chính xác tối thiểu. Sử dụng để triển khai các mô hình lớn (70B, 405B) trên GPU tiêu dùng, khi bạn cần giảm 4× bộ nhớ…

npx skills add https://github.com/firecrawl/ai-research-skills --skill gptq

Tải ZIP GitHub

GPTQ (Generative Pre-trained Transformer Quantization)

Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization.

When to use GPTQ

Use GPTQ when:

Need to fit large models (70B+) on limited GPU memory
Want 4× memory reduction with <2% accuracy loss
Deploying on consumer GPUs (RTX 4090, 3090)
Need faster inference (3-4× speedup vs FP16)

Use AWQ instead when:

Need slightly better accuracy (<1% loss)
Have newer GPUs (Ampere, Ada)
Want Marlin kernel support (2× faster on some GPUs)

Use bitsandbytes instead when:

Need simple integration with transformers
Want 8-bit quantization (less compression, better quality)
Don't need pre-quantized model files

Quick start

Installation

# Install AutoGPTQ
pip install auto-gptq

# With Triton (Linux only, faster)
pip install auto-gptq[triton]

# With CUDA extensions (faster)
pip install auto-gptq --no-build-isolation

# Full installation
pip install auto-gptq transformers accelerate

Load pre-quantized model

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

# Load quantized model from HuggingFace
model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"

model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_triton=False  # Set True on Linux for speed
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate
prompt = "Explain quantum computing"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

Quantize your own model

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset

# Load model
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Quantization config
quantize_config = BaseQuantizeConfig(
    bits=4,              # 4-bit quantization
    group_size=128,      # Group size (recommended: 128)
    desc_act=False,      # Activation order (False for CUDA kernel)
    damp_percent=0.01    # Dampening factor
)

# Load model for quantization
model = AutoGPTQForCausalLM.from_pretrained(
    model_name,
    quantize_config=quantize_config
)

# Prepare calibration data
dataset = load_dataset("c4", split="train", streaming=True)
calibration_data = [
    tokenizer(example["text"])["input_ids"][:512]
    for example in dataset.take(128)
]

# Quantize
model.quantize(calibration_data)

# Save quantized model
model.save_quantized("llama-2-7b-gptq")
tokenizer.save_pretrained("llama-2-7b-gptq")

# Push to HuggingFace
model.push_to_hub("username/llama-2-7b-gptq")

Group-wise quantization

How GPTQ works:

Group weights: Divide each weight matrix into groups (typically 128 elements)
Quantize per-group: Each group has its own scale/zero-point
Minimize error: Uses Hessian information to minimize quantization error
Result: 4-bit weights with near-FP16 accuracy

Group size trade-off:

Group Size	Model Size	Accuracy	Speed	Recommendation
-1 (per-column)	Smallest	Best	Slowest	Research only
32	Smaller	Better	Slower	High accuracy needed
128	Medium	Good	Fast	Recommended default
256	Larger	Lower	Faster	Speed critical
1024	Largest	Lowest	Fastest	Not recommended

Example:

Weight matrix: [1024, 4096] = 4.2M elements

Group size = 128:
- Groups: 4.2M / 128 = 32,768 groups
- Each group: own 4-bit scale + zero-point
- Result: Better granularity → better accuracy

Quantization configurations

Standard 4-bit (recommended)

from auto_gptq import BaseQuantizeConfig

config = BaseQuantizeConfig(
    bits=4,              # 4-bit quantization
    group_size=128,      # Standard group size
    desc_act=False,      # Faster CUDA kernel
    damp_percent=0.01    # Dampening factor
)

Performance:

Memory: 4× reduction (70B model: 140GB → 35GB)
Accuracy: ~1.5% perplexity increase
Speed: 3-4× faster than FP16

High accuracy (3-bit with larger groups)

config = BaseQuantizeConfig(
    bits=3,              # 3-bit (more compression)
    group_size=128,      # Keep standard group size
    desc_act=True,       # Better accuracy (slower)
    damp_percent=0.01
)

Trade-off:

Memory: 5× reduction
Accuracy: ~3% perplexity increase
Speed: 5× faster (but less accurate)

Maximum accuracy (4-bit with small groups)

config = BaseQuantizeConfig(
    bits=4,
    group_size=32,       # Smaller groups (better accuracy)
    desc_act=True,       # Activation reordering
    damp_percent=0.005   # Lower dampening
)

Trade-off:

Memory: 3.5× reduction (slightly larger)
Accuracy: ~0.8% perplexity increase (best)
Speed: 2-3× faster (kernel overhead)

Kernel backends

ExLlamaV2 (default, fastest)

model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_exllama=True,      # Use ExLlamaV2
    exllama_config={"version": 2}
)

Performance: 1.5-2× faster than Triton

Marlin (Ampere+ GPUs)

# Quantize with Marlin format
config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False  # Required for Marlin
)

model.quantize(calibration_data, use_marlin=True)

# Load with Marlin
model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_marlin=True  # 2× faster on A100/H100
)

Requirements:

NVIDIA Ampere or newer (A100, H100, RTX 40xx)
Compute capability ≥ 8.0

Triton (Linux only)

model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_triton=True  # Linux only
)

Performance: 1.2-1.5× faster than CUDA backend

Integration with transformers

Direct transformers usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load quantized model (transformers auto-detects GPTQ)
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-13B-Chat-GPTQ",
    device_map="auto",
    trust_remote_code=False
)

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")

# Use like any transformers model
inputs = tokenizer("Hello", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)

QLoRA fine-tuning (GPTQ + LoRA)

from transformers import AutoModelForCausalLM
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

# Load GPTQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto"
)

# Prepare for LoRA training
model = prepare_model_for_kbit_training(model)

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Add LoRA adapters
model = get_peft_model(model, lora_config)

# Fine-tune (memory efficient!)
# 70B model trainable on single A100 80GB

Performance benchmarks

Memory reduction

Model	FP16	GPTQ 4-bit	Reduction
Llama 2-7B	14 GB	3.5 GB	4×
Llama 2-13B	26 GB	6.5 GB	4×
Llama 2-70B	140 GB	35 GB	4×
Llama 3-405B	810 GB	203 GB	4×

Enables:

70B on single A100 80GB (vs 2× A100 needed for FP16)
405B on 3× A100 80GB (vs 11× A100 needed for FP16)
13B on RTX 4090 24GB (vs OOM with FP16)

Inference speed (Llama 2-7B, A100)

Precision	Tokens/sec	vs FP16
FP16	25 tok/s	1×
GPTQ 4-bit (CUDA)	85 tok/s	3.4×
GPTQ 4-bit (ExLlama)	105 tok/s	4.2×
GPTQ 4-bit (Marlin)	120 tok/s	4.8×

Accuracy (perplexity on WikiText-2)

Model	FP16	GPTQ 4-bit (g=128)	Degradation
Llama 2-7B	5.47	5.55	+1.5%
Llama 2-13B	4.88	4.95	+1.4%
Llama 2-70B	3.32	3.38	+1.8%

Excellent quality preservation - less than 2% degradation!

Common patterns

Multi-GPU deployment

# Automatic device mapping
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-GPTQ",
    device_map="auto",  # Automatically split across GPUs
    max_memory={0: "40GB", 1: "40GB"}  # Limit per GPU
)

# Manual device mapping
device_map = {
    "model.embed_tokens": 0,
    "model.layers.0-39": 0,  # First 40 layers on GPU 0
    "model.layers.40-79": 1,  # Last 40 layers on GPU 1
    "model.norm": 1,
    "lm_head": 1
}

model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device_map=device_map
)

CPU offloading

# Offload some layers to CPU (for very large models)
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-405B-GPTQ",
    device_map="auto",
    max_memory={
        0: "80GB",  # GPU 0
        1: "80GB",  # GPU 1
        2: "80GB",  # GPU 2
        "cpu": "200GB"  # Offload overflow to CPU
    }
)

Batch inference

# Process multiple prompts efficiently
prompts = [
    "Explain AI",
    "Explain ML",
    "Explain DL"
]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    pad_token_id=tokenizer.eos_token_id
)

for i, output in enumerate(outputs):
    print(f"Prompt {i}: {tokenizer.decode(output)}")

Finding pre-quantized models

TheBloke on HuggingFace:

https://huggingface.co/TheBloke
1000+ models in GPTQ format
Multiple group sizes (32, 128)
Both CUDA and Marlin formats

Search:

# Find GPTQ models on HuggingFace
https://huggingface.co/models?library=gptq

Download:

from auto_gptq import AutoGPTQForCausalLM

# Automatically downloads from HuggingFace
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-Chat-GPTQ",
    device="cuda:0"
)

Supported models

LLaMA family: Llama 2, Llama 3, Code Llama
Mistral: Mistral 7B, Mixtral 8x7B, 8x22B
Qwen: Qwen, Qwen2, QwQ
DeepSeek: V2, V3
Phi: Phi-2, Phi-3
Yi, Falcon, BLOOM, OPT
100+ models on HuggingFace

References

Calibration Guide - Dataset selection, quantization process, quality optimization
Integration Guide - Transformers, PEFT, vLLM, TensorRT-LLM
Troubleshooting - Common issues, performance optimization