sentencepiece
作者: firecrawl
語言無關的分詞器,將文字視為原始 Unicode 處理。支援 BPE 與 Unigram 演算法。速度快(每秒 5 萬句)、輕量(記憶體僅 6MB)…
npx skills add https://github.com/firecrawl/ai-research-skills --skill sentencepieceSentencePiece - Language-Independent Tokenization
Unsupervised tokenizer that works on raw text without language-specific preprocessing.
When to use SentencePiece
Use SentencePiece when:
- Building multilingual models (no language-specific rules)
- Working with CJK languages (Chinese, Japanese, Korean)
- Need reproducible tokenization (deterministic vocabulary)
- Want to train on raw text (no pre-tokenization needed)
- Require lightweight deployment (6MB memory, 50k sentences/sec)
Performance:
- Speed: 50,000 sentences/sec
- Memory: ~6MB for loaded model
- Languages: All (language-independent)
Use alternatives instead:
- HuggingFace Tokenizers: Faster training, more flexibility
- tiktoken: OpenAI models (GPT-3.5/4)
- BERT WordPiece: English-centric tasks
Quick start
Installation
# Python
pip install sentencepiece
# C++ (requires CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install
Train model
# Command-line (BPE with 8000 vocab)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe
# Python API
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='m',
vocab_size=8000,
model_type='bpe'
)
Training time: ~1-2 minutes for 100MB corpus
Encode and decode
import sentencepiece as spm
# Load model
sp = spm.SentencePieceProcessor(model_file='m.model')
# Encode to pieces
pieces = sp.encode('This is a test', out_type=str)
print(pieces) # ['▁This', '▁is', '▁a', '▁test']
# Encode to IDs
ids = sp.encode('This is a test', out_type=int)
print(ids) # [284, 47, 11, 1243]
# Decode
text = sp.decode(ids)
print(text) # "This is a test"
Language-independent design
Whitespace as symbol (▁)
text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces) # ['▁Hello', '▁world']
# Decode preserves spaces
decoded = sp.decode_pieces(pieces)
print(decoded) # "Hello world"
Key principle: Treat text as raw Unicode, whitespace = ▁ (meta symbol)
Tokenization algorithms
BPE (Byte-Pair Encoding)
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='bpe_model',
vocab_size=16000,
model_type='bpe'
)
Used by: mBART
Unigram (default)
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='unigram_model',
vocab_size=8000,
model_type='unigram'
)
Used by: T5, ALBERT, XLNet
Training configuration
Essential parameters
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='m',
vocab_size=32000,
model_type='unigram',
character_coverage=0.9995, # 1.0 for CJK
user_defined_symbols=['[SEP]', '[CLS]'],
unk_piece='<unk>',
num_threads=16
)
Character coverage
| Language Type | Coverage | Rationale |
|---|---|---|
| English | 0.9995 | Most common chars |
| CJK (Chinese) | 1.0 | All characters needed |
| Multilingual | 0.9995 | Balance |
Encoding options
Subword regularization
# Sample different tokenizations
for _ in range(3):
pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
print(pieces)
# Output (different each time):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']
Use case: Data augmentation for robustness.
Common patterns
T5-style training
spm.SentencePieceTrainer.train(
input='c4_corpus.txt',
model_prefix='t5',
vocab_size=32000,
model_type='unigram',
user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
unk_id=2,
eos_id=1,
pad_id=0
)
Integration with transformers
from transformers import T5Tokenizer
# T5 uses SentencePiece internally
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')
Performance benchmarks
Training speed
| Corpus | BPE (16k) | Unigram (8k) |
|---|---|---|
| 100 MB | 1-2 min | 3-4 min |
| 1 GB | 10-15 min | 30-40 min |
Tokenization speed
- SentencePiece: 50,000 sentences/sec
- HF Tokenizers: 200,000 sentences/sec (4× faster)
Supported models
T5 family: t5-base, t5-large (32k vocab, Unigram)
ALBERT: albert-base-v2 (30k vocab, Unigram)
XLNet: xlnet-base-cased (32k vocab, Unigram)
mBART: facebook/mbart-large-50 (250k vocab, BPE)
References
- Training Guide - Detailed options, corpus preparation
- Algorithms - BPE vs Unigram, subword regularization
Resources
- GitHub: https://github.com/google/sentencepiece ⭐ 10,000+
- Paper: https://arxiv.org/abs/1808.06226 (EMNLP 2018)
- Version: 0.2.0+
來自 firecrawl 的更多技能
oracle
firecrawl
使用 oracle CLI 的最佳實踐(提示與檔案捆綁、引擎、會話及檔案附加模式)。
official
firecrawl-monitor
firecrawl
偵測網站內容何時變更,並透過 Webhook 或電子郵件接收通知 — 無需 Cron 任務、爬蟲或比對腳本。當使用者想追蹤頁面變更、監控競爭對手定價、在新職缺或部落格文章出現時收到提醒、監控文件/更新紀錄/狀態頁面,或說出「監控」、「觀察」、「追蹤」、「當...時提醒我」、「當 X 變更時通知我」、「如果...請通知我」、「當...時寄信給我」或「當...時傳送 Webhook」時,請使用此技能。內建的 AI 判斷器會過濾格式、時間戳記及...
officialweb-scrapingresearch
firecrawl-deep-research
firecrawl
使用 Firecrawl 執行多來源深度研究。當使用者要求研究某個主題、比較不同觀點、產出具來源的簡報、調查技術或市場問題,或綜合多個來源的網路證據時使用。
officialresearchweb-scraping
firecrawl-research-papers
firecrawl
使用 Firecrawl 查找並綜合研究論文、白皮書、PDF、技術報告及學術來源。適用於用戶需要文獻回顧、論文摘要、研究現狀分析,或從 PDF 及學術/行業出版物中獲取有來源的綜合資訊時。
officialresearchweb-scraping
firecrawl-market-research
firecrawl
使用 Firecrawl 提取市場、財務、收益、行業及公司指標。適用於用戶查詢市場研究、行業趨勢、上市公司數據、財務比較、收益研究或結構化市場報告時使用。
officialresearchweb-scraping
firecrawl-website-design-clone
firecrawl
使用 Firecrawl 抓取證據,將任何網站的設計系統提取為可供代理程式使用的 DESIGN.md。當使用者需要從網站取得顏色、字型、間距、元件、版面配置模式或品牌/UI 指引,以便 AI 代理程式能建立新網站、複製外觀或根據該設計建構頁面時使用。
officialdesignweb-scraping
firecrawl-knowledge-base
firecrawl
使用 Firecrawl 從網頁內容建立知識庫。適用於本地參考文件、RAG 就緒區塊、微調資料集、文件鏡像、主題語料庫,或從網路來源整理而成的 LLM 就緒 Markdown。
officialweb-scrapingresearch
firecrawl-lead-research
firecrawl
使用 Firecrawl 生成會前潛在客戶情報簡報。適用於用戶在銷售通話、合作會議、投資人對話或客戶訪談前,需要進行公司研究、人物研究、最新新聞、談話要點、痛點分析或外展準備時。
officialresearchweb-scraping