nemo-curator

作者: firecrawl

基于GPU加速的大语言模型训练数据整理,支持文本/图像/视频/音频,具备模糊去重(速度提升16倍)、质量过滤(30+启发式规则)等功能。

npx skills add https://github.com/firecrawl/ai-research-skills --skill nemo-curator

NeMo Curator - GPU-Accelerated Data Curation

NVIDIA's toolkit for preparing high-quality training data for LLMs.

When to use NeMo Curator

Use NeMo Curator when:

  • Preparing LLM training data from web scrapes (Common Crawl)
  • Need fast deduplication (16× faster than CPU)
  • Curating multi-modal datasets (text, images, video, audio)
  • Filtering low-quality or toxic content
  • Scaling data processing across GPU cluster

Performance:

  • 16× faster fuzzy deduplication (8TB RedPajama v2)
  • 40% lower TCO vs CPU alternatives
  • Near-linear scaling across GPU nodes

Use alternatives instead:

  • datatrove: CPU-based, open-source data processing
  • dolma: Allen AI's data toolkit
  • Ray Data: General ML data processing (no curation focus)

Quick start

Installation

# Text curation (CUDA 12)
uv pip install "nemo-curator[text_cuda12]"

# All modalities
uv pip install "nemo-curator[all_cuda12]"

# CPU-only (slower)
uv pip install "nemo-curator[cpu]"

Basic text curation pipeline

from nemo_curator import ScoreFilter, Modify
from nemo_curator.datasets import DocumentDataset
import pandas as pd

# Load data
df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]})
dataset = DocumentDataset(df)

# Quality filtering
def quality_score(doc):
    return len(doc["text"].split()) > 5  # Filter short docs

filtered = ScoreFilter(quality_score)(dataset)

# Deduplication
from nemo_curator.modules import ExactDuplicates
deduped = ExactDuplicates()(filtered)

# Save
deduped.to_parquet("curated_data/")

Data curation pipeline

Stage 1: Quality filtering

from nemo_curator.filters import (
    WordCountFilter,
    RepeatedLinesFilter,
    UrlRatioFilter,
    NonAlphaNumericFilter
)

# Apply 30+ heuristic filters
from nemo_curator import ScoreFilter

# Word count filter
dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))

# Remove repetitive content
dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))

# URL ratio filter
dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))

Stage 2: Deduplication

Exact deduplication:

from nemo_curator.modules import ExactDuplicates

# Remove exact duplicates
deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)

Fuzzy deduplication (16× faster on GPU):

from nemo_curator.modules import FuzzyDuplicates

# MinHash + LSH deduplication
fuzzy_dedup = FuzzyDuplicates(
    id_field="id",
    text_field="text",
    num_hashes=260,      # MinHash parameters
    num_buckets=20,
    hash_method="md5"
)

deduped = fuzzy_dedup(dataset)

Semantic deduplication:

from nemo_curator.modules import SemanticDuplicates

# Embedding-based deduplication
semantic_dedup = SemanticDuplicates(
    id_field="id",
    text_field="text",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    threshold=0.8  # Cosine similarity threshold
)

deduped = semantic_dedup(dataset)

Stage 3: PII redaction

from nemo_curator.modules import Modify
from nemo_curator.modifiers import PIIRedactor

# Redact personally identifiable information
pii_redactor = PIIRedactor(
    supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"],
    anonymize_action="replace"  # or "redact"
)

redacted = Modify(pii_redactor)(dataset)

Stage 4: Classifier filtering

from nemo_curator.classifiers import QualityClassifier

# Quality classification
quality_clf = QualityClassifier(
    model_path="nvidia/quality-classifier-deberta",
    batch_size=256,
    device="cuda"
)

# Filter low-quality documents
high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)

GPU acceleration

GPU vs CPU performance

OperationCPU (16 cores)GPU (A100)Speedup
Fuzzy dedup (8TB)120 hours7.5 hours16×
Exact dedup (1TB)8 hours0.5 hours16×
Quality filtering2 hours0.2 hours10×

Multi-GPU scaling

from nemo_curator import get_client
import dask_cuda

# Initialize GPU cluster
client = get_client(cluster_type="gpu", n_workers=8)

# Process with 8 GPUs
deduped = FuzzyDuplicates(...)(dataset)

Multi-modal curation

Image curation

from nemo_curator.image import (
    AestheticFilter,
    NSFWFilter,
    CLIPEmbedder
)

# Aesthetic scoring
aesthetic_filter = AestheticFilter(threshold=5.0)
filtered_images = aesthetic_filter(image_dataset)

# NSFW detection
nsfw_filter = NSFWFilter(threshold=0.9)
safe_images = nsfw_filter(filtered_images)

# Generate CLIP embeddings
clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32")
image_embeddings = clip_embedder(safe_images)

Video curation

from nemo_curator.video import (
    SceneDetector,
    ClipExtractor,
    InternVideo2Embedder
)

# Detect scenes
scene_detector = SceneDetector(threshold=27.0)
scenes = scene_detector(video_dataset)

# Extract clips
clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0)
clips = clip_extractor(scenes)

# Generate embeddings
video_embedder = InternVideo2Embedder()
video_embeddings = video_embedder(clips)

Audio curation

from nemo_curator.audio import (
    ASRInference,
    WERFilter,
    DurationFilter
)

# ASR transcription
asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc")
transcribed = asr(audio_dataset)

# Filter by WER (word error rate)
wer_filter = WERFilter(max_wer=0.3)
high_quality_audio = wer_filter(transcribed)

# Duration filtering
duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0)
filtered_audio = duration_filter(high_quality_audio)

Common patterns

Web scrape curation (Common Crawl)

from nemo_curator import ScoreFilter, Modify
from nemo_curator.filters import *
from nemo_curator.modules import *
from nemo_curator.datasets import DocumentDataset

# Load Common Crawl data
dataset = DocumentDataset.read_parquet("common_crawl/*.parquet")

# Pipeline
pipeline = [
    # 1. Quality filtering
    WordCountFilter(min_words=100, max_words=50000),
    RepeatedLinesFilter(max_repeated_line_fraction=0.2),
    SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3),
    UrlRatioFilter(max_url_ratio=0.3),

    # 2. Language filtering
    LanguageIdentificationFilter(target_languages=["en"]),

    # 3. Deduplication
    ExactDuplicates(id_field="id", text_field="text"),
    FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),

    # 4. PII redaction
    PIIRedactor(),

    # 5. NSFW filtering
    NSFWClassifier(threshold=0.8)
]

# Execute
for stage in pipeline:
    dataset = stage(dataset)

# Save
dataset.to_parquet("curated_common_crawl/")

Distributed processing

from nemo_curator import get_client
from dask_cuda import LocalCUDACluster

# Multi-GPU cluster
cluster = LocalCUDACluster(n_workers=8)
client = get_client(cluster=cluster)

# Process large dataset
dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet")
deduped = FuzzyDuplicates(...)(dataset)

# Cleanup
client.close()
cluster.close()

Performance benchmarks

Fuzzy deduplication (8TB RedPajama v2)

  • CPU (256 cores): 120 hours
  • GPU (8× A100): 7.5 hours
  • Speedup: 16×

Exact deduplication (1TB)

  • CPU (64 cores): 8 hours
  • GPU (4× A100): 0.5 hours
  • Speedup: 16×

Quality filtering (100GB)

  • CPU (32 cores): 2 hours
  • GPU (2× A100): 0.2 hours
  • Speedup: 10×

Cost comparison

CPU-based curation (AWS c5.18xlarge × 10):

  • Cost: $3.60/hour × 10 = $36/hour
  • Time for 8TB: 120 hours
  • Total: $4,320

GPU-based curation (AWS p4d.24xlarge × 2):

  • Cost: $32.77/hour × 2 = $65.54/hour
  • Time for 8TB: 7.5 hours
  • Total: $491.55

Savings: 89% reduction ($3,828 saved)

Supported data formats

  • Input: Parquet, JSONL, CSV
  • Output: Parquet (recommended), JSONL
  • WebDataset: TAR archives for multi-modal

Use cases

Production deployments:

  • NVIDIA used NeMo Curator to prepare Nemotron-4 training data
  • Open-source datasets curated: RedPajama v2, The Pile

References

Resources

来自 firecrawl 的更多技能

oracle
firecrawl
使用oracle CLI的最佳实践(提示词与文件打包、引擎、会话及文件附件模式)。
official
firecrawl-monitor
firecrawl
检测网站内容变化,并通过webhook或邮件接收通知——无需cron任务、爬虫或差异脚本。当用户想要追踪页面变化、监控竞争对手定价、在新职位或博客发布时接收提醒、监测文档/更新日志/状态页面,或说出“监控”、“关注”、“追踪”、“当……时提醒我”、“当X变化时通知我”、“如果……请通知我”、“当……时发邮件给我”或“当……时发送webhook”时,使用此技能。内置AI判断器会过滤掉格式、时间戳和……
officialweb-scrapingresearch
firecrawl-deep-research
firecrawl
使用 Firecrawl 进行多源深度研究。当用户要求研究某个主题、比较不同观点、生成带来源的简报、调查技术或市场问题,或综合多个来源的网络证据时使用。
officialresearchweb-scraping
firecrawl-research-papers
firecrawl
使用Firecrawl查找并综合研究论文、白皮书、PDF文件、技术报告及学术来源。适用于用户需要文献综述、论文摘要、研究现状分析,或从PDF及学术/行业出版物中获取有来源的综合内容时。
officialresearchweb-scraping
firecrawl-market-research
firecrawl
使用Firecrawl提取市场、财务、收益、行业和公司指标。当用户询问市场研究、行业趋势、上市公司数据、财务比较、收益研究或结构化市场报告时使用。
officialresearchweb-scraping
firecrawl-website-design-clone
firecrawl
使用 Firecrawl 抓取证据,将任意网站的设计系统提取为可供智能体使用的 DESIGN.md 文件。当用户需要从网站获取颜色、字体、间距、组件、布局模式或品牌/UI 指导,以便 AI 智能体创建新网站、克隆外观或受该设计启发构建页面时使用。
officialdesignweb-scraping
firecrawl-knowledge-base
firecrawl
使用Firecrawl从网页内容构建知识库。适用于本地参考文档、RAG就绪文本块、微调数据集、文档镜像、主题语料库,或从网页来源整理的LLM就绪Markdown。
officialweb-scrapingresearch
firecrawl-lead-research
firecrawl
使用Firecrawl生成会前潜在客户情报简报。适用于用户在销售通话、合作会议、投资者对话或客户访谈前需要公司调研、人物调研、最新动态、谈话要点、痛点分析或外联准备时。
officialresearchweb-scraping