chdb-datastore
by clickhouse
DataStore is a lazy, ClickHouse-backed pandas replacement . Your existing pandas code works unchanged — but operations compile to optimized SQL and execute only when results are needed (e.g., print() , len() , iteration).
npx skills add https://github.com/clickhouse/agent-skills --skill chdb-datastorechdb DataStore — It's Just Faster Pandas
The Key Insight
# Change this:
import pandas as pd
# To this:
import chdb.datastore as pd
# Everything else stays the same.
DataStore is a lazy, ClickHouse-backed pandas replacement. Your existing pandas code works unchanged — but operations compile to optimized SQL and execute only when results are needed (e.g., print(), len(), iteration).
pip install chdb
Decision Tree: Pick the Right Approach
1. "I have a file/database and want to analyze it with pandas"
→ DataStore.from_file() / from_mysql() / from_s3() etc.
→ See references/connectors.md
2. "I need to join data from different sources"
→ Create DataStores from each source, use .join()
→ See examples/examples.md #3-5
3. "My pandas code is too slow"
→ import chdb.datastore as pd — change one line, keep the rest
4. "I need raw SQL queries"
→ Use the chdb-sql skill instead
Connect to Any Data Source — One Pattern
from datastore import DataStore
# Local file (auto-detects .parquet, .csv, .json, .arrow, .orc, .avro, .tsv, .xml)
ds = DataStore.from_file("sales.parquet")
# Database
ds = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
# Cloud storage
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
# URI shorthand — auto-detects source type
ds = DataStore.uri("mysql://root:pass@db:3306/shop/orders")
All 16+ sources and URI schemes → connectors.md
After Connecting — Full Pandas API
result = ds[ds["age"] > 25] # filter
result = ds[["name", "city"]] # select columns
result = ds.sort_values("revenue", ascending=False) # sort
result = ds.groupby("dept")["salary"].mean() # groupby
result = ds.assign(margin=lambda x: x["profit"] / x["revenue"]) # computed column
ds["name"].str.upper() # string accessor
ds["date"].dt.year # datetime accessor
result = ds1.join(ds2, on="id") # join
result = ds.head(10) # preview
print(ds.to_sql()) # see generated SQL
209 DataFrame methods supported. Full API → api-reference.md
Cross-Source Join — The Killer Feature
from datastore import DataStore
customers = DataStore.from_mysql(host="db:3306", database="crm", table="customers", user="root", password="pass")
orders = DataStore.from_file("orders.parquet")
result = (orders
.join(customers, left_on="customer_id", right_on="id")
.groupby("country")
.agg({"amount": "sum", "rating": "mean"})
.sort_values("sum", ascending=False))
print(result)
More join examples → examples.md
Writing Data
source = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
target = DataStore("file", path="summary.parquet", format="Parquet")
target.insert_into("category", "total", "count").select_from(
source.groupby("category").select("category", "sum(amount) AS total", "count() AS count")
).execute()
Troubleshooting
| Problem | Fix |
|---|---|
ImportError: No module named 'chdb' | pip install chdb |
ImportError: cannot import 'DataStore' | Use from datastore import DataStore or from chdb.datastore import DataStore |
| Database connection timeout | Include port in host: host="db:3306" not host="db" |
| Join returns empty result | Check key types match (both int or both string); use .to_sql() to inspect |
| Unexpected results | Call ds.to_sql() to see the generated SQL and debug |
| Environment check | Run python scripts/verify_install.py (from skill directory) |
References
- API Reference — Full DataStore method signatures
- Connectors — All 16+ data source connection methods
- Examples — 10+ runnable examples with expected output
- Verify Install — Environment verification script
- Official Docs
Note: This skill teaches how to use chdb DataStore. For raw SQL queries, use the
chdb-sqlskill. For contributing to chdb source code, see CLAUDE.md in the project root.
More skills from clickhouse
chdb-sql
clickhouse
Run ClickHouse SQL directly in Python — no server needed. Query local files, remote databases, and cloud storage with full ClickHouse SQL power.
official
clickhouse-architecture-advisor
clickhouse
MUST USE when designing ClickHouse architectures, selecting between ingestion or modeling patterns, or translating best practices into workload-specific system…
official
clickhouse-best-practices
clickhouse
28 ClickHouse best practices rules organized by schema design, query optimization, and data ingestion strategy. Covers three critical areas: primary key and data type selection (immutable design decisions), JOIN and query optimization, and insert batching with mutation avoidance Includes 28 rules prioritized by impact, with schema design and query optimization rules marked CRITICAL due to ClickHouse's columnar storage and sparse index mechanics Provides structured review procedures for...
official
clickhousectl-cloud-deploy
clickhouse
Use when a user wants to deploy ClickHouse to the cloud, go to production, use ClickHouse Cloud, host a managed ClickHouse service, or migrate from a local…
official
clickhousectl-local-dev
clickhouse
Use when a user wants to build an application with ClickHouse, set up a local ClickHouse development environment, install ClickHouse, create a local server,…
official
setup
clickhouse
Guides users through setting up the ClickHouse MCP server connection bundled with this plugin. Use when the user first installs the plugin or has trouble…
official
clickhouse-js-node-coding
clickhouse
Reference: https://clickhouse.com/docs/integrations/javascript
official
clickhouse-js-node-troubleshooting
clickhouse
Reference: https://clickhouse.com/docs/integrations/javascript
official