huggingface-tokenizers

작성자: firecrawl

연구 및 프로덕션에 최적화된 빠른 토크나이저. Rust 기반 구현으로 1GB를 20초 미만에 토큰화합니다. BPE, WordPiece, Unigram 알고리즘을 지원합니다.…

npx skills add https://github.com/firecrawl/ai-research-skills --skill huggingface-tokenizers

HuggingFace Tokenizers - Fast Tokenization for NLP

Fast, production-ready tokenizers with Rust performance and Python ease-of-use.

When to use HuggingFace Tokenizers

Use HuggingFace Tokenizers when:

  • Need extremely fast tokenization (<20s per GB of text)
  • Training custom tokenizers from scratch
  • Want alignment tracking (token → original text position)
  • Building production NLP pipelines
  • Need to tokenize large corpora efficiently

Performance:

  • Speed: <20 seconds to tokenize 1GB on CPU
  • Implementation: Rust core with Python/Node.js bindings
  • Efficiency: 10-100× faster than pure Python implementations

Use alternatives instead:

  • SentencePiece: Language-independent, used by T5/ALBERT
  • tiktoken: OpenAI's BPE tokenizer for GPT models
  • transformers AutoTokenizer: Loading pretrained only (uses this library internally)

Quick start

Installation

# Install tokenizers
pip install tokenizers

# With transformers integration
pip install tokenizers transformers

Load pretrained tokenizer

from tokenizers import Tokenizer

# Load from HuggingFace Hub
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Encode text
output = tokenizer.encode("Hello, how are you?")
print(output.tokens)  # ['hello', ',', 'how', 'are', 'you', '?']
print(output.ids)     # [7592, 1010, 2129, 2024, 2017, 1029]

# Decode back
text = tokenizer.decode(output.ids)
print(text)  # "hello, how are you?"

Train custom BPE tokenizer

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize tokenizer with BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

# Configure trainer
trainer = BpeTrainer(
    vocab_size=30000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    min_frequency=2
)

# Train on files
files = ["train.txt", "validation.txt"]
tokenizer.train(files, trainer)

# Save
tokenizer.save("my-tokenizer.json")

Training time: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB

Batch encoding with padding

# Enable padding
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")

# Encode batch
texts = ["Hello world", "This is a longer sentence"]
encodings = tokenizer.encode_batch(texts)

for encoding in encodings:
    print(encoding.ids)
# [101, 7592, 2088, 102, 3, 3, 3]
# [101, 2023, 2003, 1037, 2936, 6251, 102]

Tokenization algorithms

BPE (Byte-Pair Encoding)

How it works:

  1. Start with character-level vocabulary
  2. Find most frequent character pair
  3. Merge into new token, add to vocabulary
  4. Repeat until vocabulary size reached

Used by: GPT-2, GPT-3, RoBERTa, BART, DeBERTa

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel

tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
tokenizer.pre_tokenizer = ByteLevel()

trainer = BpeTrainer(
    vocab_size=50257,
    special_tokens=["<|endoftext|>"],
    min_frequency=2
)

tokenizer.train(files=["data.txt"], trainer=trainer)

Advantages:

  • Handles OOV words well (breaks into subwords)
  • Flexible vocabulary size
  • Good for morphologically rich languages

Trade-offs:

  • Tokenization depends on merge order
  • May split common words unexpectedly

WordPiece

How it works:

  1. Start with character vocabulary
  2. Score merge pairs: frequency(pair) / (frequency(first) × frequency(second))
  3. Merge highest scoring pair
  4. Repeat until vocabulary size reached

Used by: BERT, DistilBERT, MobileBERT

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import BertNormalizer

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = Whitespace()

trainer = WordPieceTrainer(
    vocab_size=30522,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    continuing_subword_prefix="##"
)

tokenizer.train(files=["corpus.txt"], trainer=trainer)

Advantages:

  • Prioritizes meaningful merges (high score = semantically related)
  • Used successfully in BERT (state-of-the-art results)

Trade-offs:

  • Unknown words become [UNK] if no subword match
  • Saves vocabulary, not merge rules (larger files)

Unigram

How it works:

  1. Start with large vocabulary (all substrings)
  2. Compute loss for corpus with current vocabulary
  3. Remove tokens with minimal impact on loss
  4. Repeat until vocabulary size reached

Used by: ALBERT, T5, mBART, XLNet (via SentencePiece)

from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer

tokenizer = Tokenizer(Unigram())

trainer = UnigramTrainer(
    vocab_size=8000,
    special_tokens=["<unk>", "<s>", "</s>"],
    unk_token="<unk>"
)

tokenizer.train(files=["data.txt"], trainer=trainer)

Advantages:

  • Probabilistic (finds most likely tokenization)
  • Works well for languages without word boundaries
  • Handles diverse linguistic contexts

Trade-offs:

  • Computationally expensive to train
  • More hyperparameters to tune

Tokenization pipeline

Complete pipeline: Normalization → Pre-tokenization → Model → Post-processing

Normalization

Clean and standardize text:

from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence

tokenizer.normalizer = Sequence([
    NFD(),           # Unicode normalization (decompose)
    Lowercase(),     # Convert to lowercase
    StripAccents()   # Remove accents
])

# Input: "Héllo WORLD"
# After normalization: "hello world"

Common normalizers:

  • NFD, NFC, NFKD, NFKC - Unicode normalization forms
  • Lowercase() - Convert to lowercase
  • StripAccents() - Remove accents (é → e)
  • Strip() - Remove whitespace
  • Replace(pattern, content) - Regex replacement

Pre-tokenization

Split text into word-like units:

from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel

# Split on whitespace and punctuation
tokenizer.pre_tokenizer = Sequence([
    Whitespace(),
    Punctuation()
])

# Input: "Hello, world!"
# After pre-tokenization: ["Hello", ",", "world", "!"]

Common pre-tokenizers:

  • Whitespace() - Split on spaces, tabs, newlines
  • ByteLevel() - GPT-2 style byte-level splitting
  • Punctuation() - Isolate punctuation
  • Digits(individual_digits=True) - Split digits individually
  • Metaspace() - Replace spaces with ▁ (SentencePiece style)

Post-processing

Add special tokens for model input:

from tokenizers.processors import TemplateProcessing

# BERT-style: [CLS] sentence [SEP]
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B [SEP]",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

Common patterns:

# GPT-2: sentence <|endoftext|>
TemplateProcessing(
    single="$A <|endoftext|>",
    special_tokens=[("<|endoftext|>", 50256)]
)

# RoBERTa: <s> sentence </s>
TemplateProcessing(
    single="<s> $A </s>",
    pair="<s> $A </s> </s> $B </s>",
    special_tokens=[("<s>", 0), ("</s>", 2)]
)

Alignment tracking

Track token positions in original text:

output = tokenizer.encode("Hello, world!")

# Get token offsets
for token, offset in zip(output.tokens, output.offsets):
    start, end = offset
    print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")

# Output:
# hello      → [ 0,  5): 'Hello'
# ,          → [ 5,  6): ','
# world      → [ 7, 12): 'world'
# !          → [12, 13): '!'

Use cases:

  • Named entity recognition (map predictions back to text)
  • Question answering (extract answer spans)
  • Token classification (align labels to original positions)

Integration with transformers

Load with AutoTokenizer

from transformers import AutoTokenizer

# AutoTokenizer automatically uses fast tokenizers
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Check if using fast tokenizer
print(tokenizer.is_fast)  # True

# Access underlying tokenizers.Tokenizer
fast_tokenizer = tokenizer.backend_tokenizer
print(type(fast_tokenizer))  # <class 'tokenizers.Tokenizer'>

Convert custom tokenizer to transformers

from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast

# Train custom tokenizer
tokenizer = Tokenizer(BPE())
# ... train tokenizer ...
tokenizer.save("my-tokenizer.json")

# Wrap for transformers
transformers_tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="my-tokenizer.json",
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]"
)

# Use like any transformers tokenizer
outputs = transformers_tokenizer(
    "Hello world",
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

Common patterns

Train from iterator (large datasets)

from datasets import load_dataset

# Load dataset
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")

# Create batch iterator
def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i:i + batch_size]["text"]

# Train tokenizer
tokenizer.train_from_iterator(
    batch_iterator(),
    trainer=trainer,
    length=len(dataset)  # For progress bar
)

Performance: Processes 1GB in ~10-20 minutes

Enable truncation and padding

# Enable truncation
tokenizer.enable_truncation(max_length=512)

# Enable padding
tokenizer.enable_padding(
    pad_id=tokenizer.token_to_id("[PAD]"),
    pad_token="[PAD]",
    length=512  # Fixed length, or None for batch max
)

# Encode with both
output = tokenizer.encode("This is a long sentence that will be truncated...")
print(len(output.ids))  # 512

Multi-processing

from tokenizers import Tokenizer
from multiprocessing import Pool

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

def encode_batch(texts):
    return tokenizer.encode_batch(texts)

# Process large corpus in parallel
with Pool(8) as pool:
    # Split corpus into chunks
    chunk_size = 1000
    chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]

    # Encode in parallel
    results = pool.map(encode_batch, chunks)

Speedup: 5-8× with 8 cores

Performance benchmarks

Training speed

Corpus SizeBPE (30k vocab)WordPiece (30k)Unigram (8k)
10 MB15 sec18 sec25 sec
100 MB1.5 min2 min4 min
1 GB15 min20 min40 min

Hardware: 16-core CPU, tested on English Wikipedia

Tokenization speed

Implementation1 GB corpusThroughput
Pure Python~20 minutes~50 MB/min
HF Tokenizers~15 seconds~4 GB/min
Speedup80×80×

Test: English text, average sentence length 20 words

Memory usage

TaskMemory
Load tokenizer~10 MB
Train BPE (30k vocab)~200 MB
Encode 1M sentences~500 MB

Supported models

Pre-trained tokenizers available via from_pretrained():

BERT family:

  • bert-base-uncased, bert-large-cased
  • distilbert-base-uncased
  • roberta-base, roberta-large

GPT family:

  • gpt2, gpt2-medium, gpt2-large
  • distilgpt2

T5 family:

  • t5-small, t5-base, t5-large
  • google/flan-t5-xxl

Other:

  • facebook/bart-base, facebook/mbart-large-cc25
  • albert-base-v2, albert-xlarge-v2
  • xlm-roberta-base, xlm-roberta-large

Browse all: https://huggingface.co/models?library=tokenizers

References

Resources

firecrawl의 다른 스킬

oracle
firecrawl
oracle CLI 사용 모범 사례 (프롬프트 + 파일 번들링, 엔진, 세션 및 파일 첨부 패턴)
official
firecrawl-monitor
firecrawl
웹사이트 콘텐츠 변경을 감지하고 웹훅이나 이메일로 알림을 받습니다 — 크론 작업, 스크래퍼, diff 스크립트가 필요하지 않습니다. 사용자가 페이지 변경 사항을 추적하거나, 경쟁사 가격을 모니터링하거나, 새 채용 공고나 블로그 게시물에 대한 알림을 받거나, 문서/변경 로그/상태 페이지를 모니터링하거나, "모니터링", "감시", "추적", "변경 시 알림", "X가 변경되면 알림", "변경되면 알려줘", "변경 시 이메일 보내줘", "웹훅 보내줘"라고 말할 때 이 스킬을 사용하세요. 내장된 AI 판별기가 포맷, 타임스탬프 등을 필터링합니다...
officialweb-scrapingresearch
firecrawl-deep-research
firecrawl
Firecrawl을 사용하여 다중 소스 심층 연구를 실행합니다. 사용자가 주제를 조사하거나, 관점을 비교하거나, 출처가 포함된 브리핑을 작성하거나, 기술적 또는 시장 관련 질문을 조사하거나, 여러 소스의 웹 증거를 종합하도록 요청할 때 사용하세요.
officialresearchweb-scraping
firecrawl-research-papers
firecrawl
Firecrawl을 사용하여 연구 논문, 백서, PDF, 기술 보고서 및 학술 자료를 찾고 종합합니다. 사용자가 문헌 검토, 논문 요약, 연구 동향, 또는 PDF 및 학술/산업 간행물에서 출처가 포함된 종합 정보를 원할 때 사용하세요.
officialresearchweb-scraping
firecrawl-market-research
firecrawl
Firecrawl을 사용하여 시장, 재무, 실적, 산업 및 기업 지표를 추출합니다. 사용자가 시장 조사, 산업 동향, 상장 기업 데이터, 재무 비교, 실적 조사 또는 구조화된 시장 보고서를 요청할 때 사용하세요.
officialresearchweb-scraping
firecrawl-website-design-clone
firecrawl
Firecrawl 스크레이프 증거를 사용하여 모든 웹사이트의 디자인 시스템을 에이전트가 사용할 수 있는 DESIGN.md로 추출합니다. 사용자가 웹사이트의 색상, 글꼴, 간격, 구성 요소, 레이아웃 패턴 또는 브랜드/UI 가이드를 원할 때 사용하여 AI 에이전트가 새 웹사이트를 만들거나, 디자인을 복제하거나, 해당 디자인에서 영감을 받은 페이지를 구축할 수 있도록 합니다.
officialdesignweb-scraping
firecrawl-knowledge-base
firecrawl
Firecrawl을 사용하여 웹 콘텐츠로 지식 베이스를 구축하세요. 로컬 참조 문서, RAG 준비 청크, 파인튜닝 데이터셋, 문서 미러, 주제 코퍼스 또는 웹 소스에서 정리된 LLM 준비 마크다운에 사용할 수 있습니다.
officialweb-scrapingresearch
firecrawl-lead-research
firecrawl
Firecrawl을 사용하여 회의 전 리드 인텔리전스 브리핑을 생성합니다. 사용자가 영업 통화, 파트너십 회의, 투자자 대화 또는 고객 인터뷰 전에 회사 조사, 인물 조사, 최신 뉴스, 대화 포인트, 문제점 또는 아웃리치 준비가 필요할 때 사용합니다.
officialresearchweb-scraping