tracing-upstream-lineage

作者： astronomer

追踪上游数据血缘，识别为表或列提供数据的来源、DAG及依赖关系。支持追踪三种目标类型：表、列和DAG；通过Airflow DAG源代码和任务检查来查找生产管道。处理SQL来源（FROM子句）、外部系统（S3、Postgres、Salesforce、HTTP API）和基于文件的来源；递归追踪上游链。包含通过DAG代码中的直接映射、转换和聚合实现的列级追踪...

npx skills add https://github.com/astronomer/agents --skill tracing-upstream-lineage

下载 ZIP GitHub

395

Upstream Lineage: Sources

Trace the origins of data - answer "Where does this data come from?"

Lineage Investigation

Step 1: Identify the Target Type

Determine what we're tracing:

Table: Trace what populates this table
Column: Trace where this specific column comes from
DAG: Trace what data sources this DAG reads from

Step 2: Find the Producing DAG

Tables are typically populated by Airflow DAGs. Find the connection:

Search DAGs by name: Use af dags list and look for DAG names matching the table name
- load_customers -> customers table
- etl_daily_orders -> orders table
Explore DAG source code: Use af dags source <dag_id> to read the DAG definition
- Look for INSERT, MERGE, CREATE TABLE statements
- Find the target table in the code
Check DAG tasks: Use af tasks list <dag_id> to see what operations the DAG performs

On Astro

If you're running on Astro, the Lineage tab in the Astro UI provides visual lineage exploration across DAGs and datasets. Use it to quickly trace upstream dependencies without manually searching DAG source code.

On OSS Airflow

Use DAG source code and task logs to trace lineage (no built-in cross-DAG UI).

Step 3: Trace Data Sources

From the DAG code, identify source tables and systems:

SQL Sources (look for FROM clauses):

# In DAG code:
SELECT * FROM source_schema.source_table  # <- This is an upstream source

External Sources (look for connection references):

S3Operator -> S3 bucket source
PostgresOperator -> Postgres database source
SalesforceOperator -> Salesforce API source
HttpOperator -> REST API source

File Sources:

CSV/Parquet files in object storage
SFTP drops
Local file paths

Step 4: Build the Lineage Chain

Recursively trace each source:

TARGET: analytics.orders_daily
    ^
    +-- DAG: etl_daily_orders
            ^
            +-- SOURCE: raw.orders (table)
            |       ^
            |       +-- DAG: ingest_orders
            |               ^
            |               +-- SOURCE: Salesforce API (external)
            |
            +-- SOURCE: dim.customers (table)
                    ^
                    +-- DAG: load_customers
                            ^
                            +-- SOURCE: PostgreSQL (external DB)

Step 5: Check Source Health

For each upstream source:

Tables: Check freshness with the checking-freshness skill
DAGs: Check recent run status with af dags stats
External systems: Note connection info from DAG code

Lineage for Columns

When tracing a specific column:

Find the column in the target table schema
Search DAG source code for references to that column name
Trace through transformations:
- Direct mappings: source.col AS target_col
- Transformations: COALESCE(a.col, b.col) AS target_col
- Aggregations: SUM(detail.amount) AS total_amount

Output: Lineage Report

Summary

One-line answer: "This table is populated by DAG X from sources Y and Z"

Lineage Diagram

[Salesforce] --> [raw.opportunities] --> [stg.opportunities] --> [fct.sales]
                        |                        |
                   DAG: ingest_sfdc         DAG: transform_sales