chdb-datastore
作者: clickhouse
DataStore 是一个基于 ClickHouse 的惰性 pandas 替代方案。你现有的 pandas 代码无需修改即可运行——但操作会被编译为优化的 SQL,并且仅在需要结果时(例如 print()、len()、迭代)才执行。
npx skills add https://github.com/clickhouse/agent-skills --skill chdb-datastorechdb DataStore — It's Just Faster Pandas
The Key Insight
# Change this:
import pandas as pd
# To this:
import chdb.datastore as pd
# Everything else stays the same.
DataStore is a lazy, ClickHouse-backed pandas replacement. Your existing pandas code works unchanged — but operations compile to optimized SQL and execute only when results are needed (e.g., print(), len(), iteration).
pip install chdb
Decision Tree: Pick the Right Approach
1. "I have a file/database and want to analyze it with pandas"
→ DataStore.from_file() / from_mysql() / from_s3() etc.
→ See references/connectors.md
2. "I need to join data from different sources"
→ Create DataStores from each source, use .join()
→ See examples/examples.md #3-5
3. "My pandas code is too slow"
→ import chdb.datastore as pd — change one line, keep the rest
4. "I need raw SQL queries"
→ Use the chdb-sql skill instead
Connect to Any Data Source — One Pattern
from datastore import DataStore
# Local file (auto-detects .parquet, .csv, .json, .arrow, .orc, .avro, .tsv, .xml)
ds = DataStore.from_file("sales.parquet")
# Database
ds = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
# Cloud storage
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
# URI shorthand — auto-detects source type
ds = DataStore.uri("mysql://root:pass@db:3306/shop/orders")
All 16+ sources and URI schemes → connectors.md
After Connecting — Full Pandas API
result = ds[ds["age"] > 25] # filter
result = ds[["name", "city"]] # select columns
result = ds.sort_values("revenue", ascending=False) # sort
result = ds.groupby("dept")["salary"].mean() # groupby
result = ds.assign(margin=lambda x: x["profit"] / x["revenue"]) # computed column
ds["name"].str.upper() # string accessor
ds["date"].dt.year # datetime accessor
result = ds1.join(ds2, on="id") # join
result = ds.head(10) # preview
print(ds.to_sql()) # see generated SQL
209 DataFrame methods supported. Full API → api-reference.md
Cross-Source Join — The Killer Feature
from datastore import DataStore
customers = DataStore.from_mysql(host="db:3306", database="crm", table="customers", user="root", password="pass")
orders = DataStore.from_file("orders.parquet")
result = (orders
.join(customers, left_on="customer_id", right_on="id")
.groupby("country")
.agg({"amount": "sum", "rating": "mean"})
.sort_values("sum", ascending=False))
print(result)
More join examples → examples.md
Writing Data
source = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
target = DataStore("file", path="summary.parquet", format="Parquet")
target.insert_into("category", "total", "count").select_from(
source.groupby("category").select("category", "sum(amount) AS total", "count() AS count")
).execute()
Troubleshooting
| Problem | Fix |
|---|---|
ImportError: No module named 'chdb' | pip install chdb |
ImportError: cannot import 'DataStore' | Use from datastore import DataStore or from chdb.datastore import DataStore |
| Database connection timeout | Include port in host: host="db:3306" not host="db" |
| Join returns empty result | Check key types match (both int or both string); use .to_sql() to inspect |
| Unexpected results | Call ds.to_sql() to see the generated SQL and debug |
| Environment check | Run python scripts/verify_install.py (from skill directory) |
References
- API Reference — Full DataStore method signatures
- Connectors — All 16+ data source connection methods
- Examples — 10+ runnable examples with expected output
- Verify Install — Environment verification script
- Official Docs
Note: This skill teaches how to use chdb DataStore. For raw SQL queries, use the
chdb-sqlskill. For contributing to chdb source code, see CLAUDE.md in the project root.
来自 clickhouse 的更多技能
chdb-sql
clickhouse
直接在Python中运行ClickHouse SQL——无需服务器。使用完整的ClickHouse SQL功能查询本地文件、远程数据库和云存储。
official
clickhouse-architecture-advisor
clickhouse
在设计ClickHouse架构、选择数据摄入或建模模式,或将最佳实践转化为特定工作负载的系统时,必须使用……
official
clickhouse-best-practices
clickhouse
28条ClickHouse最佳实践规则,按模式设计、查询优化和数据摄入策略组织。涵盖三个关键领域:主键与数据类型选择(不可变设计决策)、JOIN与查询优化、批量插入与避免突变。包含28条按影响程度排序的规则,其中模式设计和查询优化规则因ClickHouse的列式存储和稀疏索引机制被标记为关键。提供结构化审查流程用于...
official
clickhousectl-cloud-deploy
clickhouse
当用户希望将ClickHouse部署到云端、投入生产环境、使用ClickHouse Cloud、托管托管式ClickHouse服务,或从本地迁移时使用…
official
clickhousectl-local-dev
clickhouse
当用户想要使用ClickHouse构建应用程序、搭建本地ClickHouse开发环境、安装ClickHouse、创建本地服务器时使用…
official
setup
clickhouse
引导用户完成此插件附带的ClickHouse MCP服务器连接的设置。在用户首次安装插件或遇到问题时使用…
official
clickhouse-js-node-coding
clickhouse
参考:https://clickhouse.com/docs/integrations/javascript
official
clickhouse-js-node-troubleshooting
clickhouse
参考:https://clickhouse.com/docs/integrations/javascript
official