profiling-tables

作者: astronomer

对数据库表进行全面的统计与质量分析,生成结构化的剖析结果。根据数据类型定制列级统计:数值列的最小值/最大值/百分位数、字符串的长度指标、时间戳的日期范围。执行基数分析以识别分类列与高基数列,并检测偏态分布。从五个维度评估数据质量:完整性(NULL率)、唯一性(重复值)、时效性(更新时间戳)……

npx skills add https://github.com/astronomer/agents --skill profiling-tables

Data Profile

Generate a comprehensive profile of a table that a new team member could use to understand the data.

Step 1: Basic Metadata

Query column metadata:

SELECT COLUMN_NAME, DATA_TYPE, COMMENT
FROM <database>.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = '<schema>' AND TABLE_NAME = '<table>'
ORDER BY ORDINAL_POSITION

If the table name isn't fully qualified, search INFORMATION_SCHEMA.TABLES to locate it first.

Step 2: Size and Shape

Run via run_sql:

SELECT
    COUNT(*) as total_rows,
    COUNT(*) / 1000000.0 as millions_of_rows
FROM <table>

Step 3: Column-Level Statistics

For each column, gather appropriate statistics based on data type:

Numeric Columns

SELECT
    MIN(column_name) as min_val,
    MAX(column_name) as max_val,
    AVG(column_name) as avg_val,
    STDDEV(column_name) as std_dev,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) as median,
    SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count,
    COUNT(DISTINCT column_name) as distinct_count
FROM <table>

String Columns

SELECT
    MIN(LEN(column_name)) as min_length,
    MAX(LEN(column_name)) as max_length,
    AVG(LEN(column_name)) as avg_length,
    SUM(CASE WHEN column_name IS NULL OR column_name = '' THEN 1 ELSE 0 END) as empty_count,
    COUNT(DISTINCT column_name) as distinct_count
FROM <table>

Date/Timestamp Columns

SELECT
    MIN(column_name) as earliest,
    MAX(column_name) as latest,
    DATEDIFF('day', MIN(column_name), MAX(column_name)) as date_range_days,
    SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count
FROM <table>

Step 4: Cardinality Analysis

For columns that look like categorical/dimension keys:

SELECT
    column_name,
    COUNT(*) as frequency,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM <table>
GROUP BY column_name
ORDER BY frequency DESC
LIMIT 20

This reveals:

  • High-cardinality columns (likely IDs or unique values)
  • Low-cardinality columns (likely categories or status fields)
  • Skewed distributions (one value dominates)

Step 5: Sample Data

Get representative rows:

SELECT *
FROM <table>
LIMIT 10

If the table is large and you want variety, sample from different time periods or categories.

Step 6: Data Quality Assessment

Summarize quality across dimensions:

Completeness

  • Which columns have NULLs? What percentage?
  • Are NULLs expected or problematic?

Uniqueness

  • Does the apparent primary key have duplicates?
  • Are there unexpected duplicate rows?

Freshness

  • When was data last updated? (MAX of timestamp columns)
  • Is the update frequency as expected?

Validity

  • Are there values outside expected ranges?
  • Are there invalid formats (dates, emails, etc.)?
  • Are there orphaned foreign keys?

Consistency

  • Do related columns make sense together?
  • Are there logical contradictions?

Step 7: Output Summary

Provide a structured profile:

Overview

2-3 sentences describing what this table contains, who uses it, and how fresh it is.

Schema

ColumnTypeNulls%DistinctDescription
...............

Key Statistics

  • Row count: X
  • Date range: Y to Z
  • Last updated: timestamp

Data Quality Score

  • Completeness: X/10
  • Uniqueness: X/10
  • Freshness: X/10
  • Overall: X/10

Potential Issues

List any data quality concerns discovered.

Recommended Queries

3-5 useful queries for common questions about this data.

来自 astronomer 的更多技能

airflow
astronomer
查询、管理和排查Apache Airflow的DAG、运行记录、任务及系统配置。支持30多种命令,涵盖DAG检查、运行管理、任务日志、配置查询及直接REST API访问。通过持久化配置管理多个Airflow实例;自动发现本地和Astro部署。同步(等待完成)或异步触发DAG运行,诊断故障,清除运行记录以重试,并通过重试/映射索引过滤访问任务日志。输出...
official
airflow-hitl
astronomer
在Airflow DAG中使用可延迟操作符实现人工审批关卡、表单输入和分支。四种操作符类型:用于批准/拒绝决策的ApprovalOperator、带表单的多选项选择HITLOperator、人工驱动的任务路由HITLBranchOperator,以及表单数据收集HITLEntryOperator。所有操作符均为可延迟设计,在通过Airflow UI的"必需操作"标签页或REST API等待人工响应时释放工作槽位。支持包括自定义在内的可选功能...
official
airflow-plugins
astronomer
构建嵌入FastAPI应用、自定义UI页面、React组件、中间件、宏和操作符链接的Airflow 3.1+插件,直接集成到Airflow UI中。使用…
official
analyzing-data
astronomer
查询数据仓库,利用缓存的模式和概念映射来回答业务问题。支持对重复问题类型进行模式查找和缓存,并通过记录结果来改进后续查询。包含概念到表的映射缓存,以及通过INFORMATION_SCHEMA或代码库grep进行表结构发现。提供run_sql()和run_sql_pandas()内核函数,返回Polars或Pandas DataFrame用于分析。提供CLI命令用于管理概念、模式和表缓存,以及...
official
annotating-task-lineage
astronomer
使用入口和出口为Airflow任务标注数据血缘。支持使用OpenLineage Dataset对象、Airflow Assets和Airflow Datasets定义跨数据库、数据仓库及云存储的输入输出。当运算符缺少内置OpenLineage提取器时作为备用方案;遵循四级优先级系统,其中自定义提取器和OpenLineage方法优先。包含针对Snowflake、BigQuery、S3和PostgreSQL的数据集命名辅助工具,以确保一致性...
official
authoring-dags
astronomer
创建Apache Airflow DAG的引导式工作流,集成验证与测试。采用六阶段结构化方法:发现环境与现有模式、规划DAG结构、遵循最佳实践实现、通过af CLI命令验证、经用户同意测试、迭代修复。用于发现(af config connections、af config providers、af dags list)和验证(af dags errors、af dags get、af dags explore)的CLI命令可提供DAG的即时反馈...
official
blueprint
astronomer
使用Pydantic验证定义可复用的Airflow任务组模板,并从YAML组合DAG。适用于创建blueprint模板、从YAML组合DAG等场景。
official
checking-freshness
astronomer
通过检查表时间戳和更新模式与陈旧度标尺对比,验证数据新鲜度。使用常见ETL命名模式(如_loaded_at、_updated_at、created_at等)识别时间戳列,并查询其最大值以确定数据时效。将数据分为四种新鲜度状态:新鲜(<4小时)、陈旧(4–24小时)、非常陈旧(>24小时)或未知(未找到时间戳)。提供SQL模板,用于检查最近几天的最后更新时间及行数变化趋势。
official