inference-serverby huggingface

Start and test the prime-rl inference server. Use when asked to run inference, start vLLM, test a model, or launch the inference server.

npx skills add https://github.com/huggingface/prime-rl --skill inference-server

Inference Server

Starting the server

Always use the inference entry point — never vllm serve or python -m vllm.entrypoints.openai.api_server directly. The entry point runs setup_vllm_env() which configures environment variables (LoRA, multiprocessing) before vLLM is imported.

# With a TOML config
uv run inference @ path/to/config.toml

# With CLI overrides
uv run inference --model.name Qwen/Qwen3-0.6B --model.max_model_len 2048 --model.enforce_eager

# Combined
uv run inference @ path/to/config.toml --server.port 8001 --gpu-memory-utilization 0.5

SLURM scheduling

The inference entrypoint supports optional SLURM scheduling, following the same patterns as SFT and RL.

Single-node SLURM

# inference_slurm.toml
output_dir = "/shared/outputs/my-inference"

[model]
name = "Qwen/Qwen3-8B"

[parallel]
tp = 8

[slurm]
job_name = "my-inference"
partition = "cluster"
uv run inference @ inference_slurm.toml

Multi-node SLURM (independent vLLM replicas)

Each node runs an independent vLLM instance. No cross-node parallelism — TP and DP must fit within a single node's GPUs.

# inference_multinode.toml
output_dir = "/shared/outputs/my-inference"

[model]
name = "PrimeIntellect/INTELLECT-3-RL-600"

[parallel]
tp = 8
dp = 1

[deployment]
type = "multi_node"
num_nodes = 4
gpus_per_node = 8

[slurm]
job_name = "my-inference"
partition = "cluster"

Dry run

Add dry_run = true to generate the sbatch script without submitting:

uv run inference @ config.toml --dry-run true

Custom endpoints

The server extends vLLM with:

  • /v1/chat/completions/tokens — accepts token IDs as prompt input (used by multi-turn RL rollouts)
  • /update_weights — hot-reload model weights from the trainer
  • /load_lora_adapter — load LoRA adapters at runtime
  • /init_broadcaster — initialize weight broadcast for distributed training

Testing the server

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "Hi"}],
    "max_tokens": 50
  }'

Key files

  • src/prime_rl/entrypoints/inference.py — entrypoint with local/SLURM routing
  • src/prime_rl/inference/server.py — vLLM env setup
  • src/prime_rl/configs/inference.pyInferenceConfig and all sub-configs
  • src/prime_rl/inference/vllm/server.py — FastAPI routes and vLLM monkey-patches
  • src/prime_rl/templates/inference.sbatch.j2 — SLURM template (handles both single and multi-node)
  • configs/debug/infer.toml — minimal debug config

More skills from huggingface

Hugging Face Cli
by huggingface
Execute Hugging Face Hub operations using the `hf` CLI. Use when the user needs to download models/datasets/spaces, upload files to Hub repositories, create repos, manage local cache, or run compute jobs on HF infrastructure. Covers authentication, file transfers, repository creation, cache operations, and cloud compute.
Hugging Face Datasets
by huggingface
Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.
Hugging Face Evaluation
by huggingface
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
Hugging Face Jobs
by huggingface
Run any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks.
Hugging Face Model Trainer
by huggingface
Train or fine-tune language models using TRL (Transformer Reinforcement Learning) on Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on dataset preparation, hardware selection, cost estimation, and model persistence.
Hugging Face Paper Publisher
by huggingface
Publish and manage research papers on Hugging Face Hub. Supports creating paper pages, linking papers to models/datasets, claiming authorship, and generating professional markdown-based research articles.
Hugging Face Tool Builder
by huggingface
Build reusable scripts and tools using the Hugging Face API. Useful when chaining or combining API calls, or when tasks will be repeated/automated. Creates reusable command line scripts to fetch, enrich, or process data from Hugging Face Hub.
Hugging Face Trackio
by huggingface
Track and visualize ML training experiments with Trackio. Use when logging metrics during training (Python API) or retrieving/analyzing logged metrics (CLI). Supports real-time dashboard visualization, HF Space syncing, and JSON output for automation.