sentencepiece
by firecrawl
Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory),…
npx skills add https://github.com/firecrawl/ai-research-skills --skill sentencepieceSentencePiece - Language-Independent Tokenization
Unsupervised tokenizer that works on raw text without language-specific preprocessing.
When to use SentencePiece
Use SentencePiece when:
- Building multilingual models (no language-specific rules)
- Working with CJK languages (Chinese, Japanese, Korean)
- Need reproducible tokenization (deterministic vocabulary)
- Want to train on raw text (no pre-tokenization needed)
- Require lightweight deployment (6MB memory, 50k sentences/sec)
Performance:
- Speed: 50,000 sentences/sec
- Memory: ~6MB for loaded model
- Languages: All (language-independent)
Use alternatives instead:
- HuggingFace Tokenizers: Faster training, more flexibility
- tiktoken: OpenAI models (GPT-3.5/4)
- BERT WordPiece: English-centric tasks
Quick start
Installation
# Python
pip install sentencepiece
# C++ (requires CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install
Train model
# Command-line (BPE with 8000 vocab)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe
# Python API
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='m',
vocab_size=8000,
model_type='bpe'
)
Training time: ~1-2 minutes for 100MB corpus
Encode and decode
import sentencepiece as spm
# Load model
sp = spm.SentencePieceProcessor(model_file='m.model')
# Encode to pieces
pieces = sp.encode('This is a test', out_type=str)
print(pieces) # ['▁This', '▁is', '▁a', '▁test']
# Encode to IDs
ids = sp.encode('This is a test', out_type=int)
print(ids) # [284, 47, 11, 1243]
# Decode
text = sp.decode(ids)
print(text) # "This is a test"
Language-independent design
Whitespace as symbol (▁)
text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces) # ['▁Hello', '▁world']
# Decode preserves spaces
decoded = sp.decode_pieces(pieces)
print(decoded) # "Hello world"
Key principle: Treat text as raw Unicode, whitespace = ▁ (meta symbol)
Tokenization algorithms
BPE (Byte-Pair Encoding)
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='bpe_model',
vocab_size=16000,
model_type='bpe'
)
Used by: mBART
Unigram (default)
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='unigram_model',
vocab_size=8000,
model_type='unigram'
)
Used by: T5, ALBERT, XLNet
Training configuration
Essential parameters
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='m',
vocab_size=32000,
model_type='unigram',
character_coverage=0.9995, # 1.0 for CJK
user_defined_symbols=['[SEP]', '[CLS]'],
unk_piece='<unk>',
num_threads=16
)
Character coverage
| Language Type | Coverage | Rationale |
|---|---|---|
| English | 0.9995 | Most common chars |
| CJK (Chinese) | 1.0 | All characters needed |
| Multilingual | 0.9995 | Balance |
Encoding options
Subword regularization
# Sample different tokenizations
for _ in range(3):
pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
print(pieces)
# Output (different each time):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']
Use case: Data augmentation for robustness.
Common patterns
T5-style training
spm.SentencePieceTrainer.train(
input='c4_corpus.txt',
model_prefix='t5',
vocab_size=32000,
model_type='unigram',
user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
unk_id=2,
eos_id=1,
pad_id=0
)
Integration with transformers
from transformers import T5Tokenizer
# T5 uses SentencePiece internally
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')
Performance benchmarks
Training speed
| Corpus | BPE (16k) | Unigram (8k) |
|---|---|---|
| 100 MB | 1-2 min | 3-4 min |
| 1 GB | 10-15 min | 30-40 min |
Tokenization speed
- SentencePiece: 50,000 sentences/sec
- HF Tokenizers: 200,000 sentences/sec (4× faster)
Supported models
T5 family: t5-base, t5-large (32k vocab, Unigram)
ALBERT: albert-base-v2 (30k vocab, Unigram)
XLNet: xlnet-base-cased (32k vocab, Unigram)
mBART: facebook/mbart-large-50 (250k vocab, BPE)
References
- Training Guide - Detailed options, corpus preparation
- Algorithms - BPE vs Unigram, subword regularization
Resources
- GitHub: https://github.com/google/sentencepiece ⭐ 10,000+
- Paper: https://arxiv.org/abs/1808.06226 (EMNLP 2018)
- Version: 0.2.0+
More skills from firecrawl
oracle
firecrawl
Best practices for using the oracle CLI (prompt + file bundling, engines, sessions, and file attachment patterns).
official
firecrawl-monitor
firecrawl
Detect when content on a website changes and get notified by webhook or email — no cron jobs, scrapers, or diff scripts required. Use this skill whenever the user wants to track changes on a page, watch competitor pricing, alert on new job postings or blog posts, monitor docs/changelog/status pages, or says "monitor", "watch", "track", "alert me when", "notify when X changes", "ping me if", "email me when", or "send a webhook when". A built-in AI judge filters out formatting, timestamp, and...
officialweb-scrapingresearch
firecrawl-deep-research
firecrawl
Run multi-source deep research with Firecrawl. Use when the user asks to research a topic, compare perspectives, produce a sourced briefing, investigate a technical or market question, or synthesize web evidence across many sources.
officialresearchweb-scraping
firecrawl-research-papers
firecrawl
Find and synthesize research papers, whitepapers, PDFs, technical reports, and academic sources with Firecrawl. Use when the user wants a literature review, paper summary, research landscape, or sourced synthesis from PDFs and scholarly/industry publications.
officialresearchweb-scraping
firecrawl-market-research
firecrawl
Extract market, financial, earnings, industry, and company metrics with Firecrawl. Use when the user asks for market research, industry trends, public company data, financial comparisons, earnings research, or structured market reports.
officialresearchweb-scraping
firecrawl-website-design-clone
firecrawl
Extract any website's design system into an agent-ready DESIGN.md using Firecrawl scrape evidence. Use when the user wants colors, fonts, spacing, components, layout patterns, or brand/UI guidance from a website so AI agents can create new websites, clone a look, or build pages inspired by that design.
officialdesignweb-scraping
firecrawl-knowledge-base
firecrawl
Build a knowledge base from web content with Firecrawl. Use for local reference docs, RAG-ready chunks, fine-tuning datasets, documentation mirrors, topic corpora, or LLM-ready markdown organized from web sources.
officialweb-scrapingresearch
firecrawl-lead-research
firecrawl
Produce pre-meeting lead intelligence briefs with Firecrawl. Use when the user needs company research, person research, recent news, talking points, pain points, or outreach preparation before a sales call, partnership meeting, investor conversation, or customer interview.
officialresearchweb-scraping