Table of Contents
Fetching ...

LINEAGEX: A Column Lineage Extraction System for SQL

Shi Heng Zhang, Zhengjie Miao, Jiannan Wang

TL;DR

LINEAGEX addresses column-level data lineage without executing queries by statically analyzing SQL via an AST-based traversal and a rules-based inference to map output columns to source input columns. It leverages a three-module pipeline (Preprocessing, Transformation with AST via SQLGlot, and Lineage Extraction) and a stack-based auto-inference to handle ambiguities like SELECT * and unprefixed references, with optional EXPLAIN integration for metadata. The contribution is a lightweight, API-friendly library with an interactive UI that yields accurate column-level lineage graphs, demonstrated on complex schemas such as MIMIC and contrasted with SQLLineage and AI-assisted analysis using GPT-4o. This supports governance tasks like data quality monitoring, refactoring, and impact analysis without requiring DBMS-level instrumentation.

Abstract

As enterprise data grows in size and complexity, column-level data lineage, which records the creation, transformation, and reference of each column in the warehouse, has been the key to effective data governance that assists tasks like data quality monitoring, storage refactoring, and workflow migration. Unfortunately, existing systems introduce overheads by integration with query execution or fail to achieve satisfying accuracy for column lineage. In this paper, we demonstrate LINEAGEX, a lightweight Python library that infers column level lineage from SQL queries and visualizes it through an interactive interface. LINEAGEX achieves high coverage and accuracy for column lineage extraction by intelligently traversing query parse trees and handling ambiguities. The demonstration walks through use cases of building lineage graphs and troubleshooting data quality issues. LINEAGEX is open sourced at https://github.com/sfu-db/lineagex and our video demonstration is at https://youtu.be/5LaBBDDitlw

LINEAGEX: A Column Lineage Extraction System for SQL

TL;DR

LINEAGEX addresses column-level data lineage without executing queries by statically analyzing SQL via an AST-based traversal and a rules-based inference to map output columns to source input columns. It leverages a three-module pipeline (Preprocessing, Transformation with AST via SQLGlot, and Lineage Extraction) and a stack-based auto-inference to handle ambiguities like SELECT * and unprefixed references, with optional EXPLAIN integration for metadata. The contribution is a lightweight, API-friendly library with an interactive UI that yields accurate column-level lineage graphs, demonstrated on complex schemas such as MIMIC and contrasted with SQLLineage and AI-assisted analysis using GPT-4o. This supports governance tasks like data quality monitoring, refactoring, and impact analysis without requiring DBMS-level instrumentation.

Abstract

As enterprise data grows in size and complexity, column-level data lineage, which records the creation, transformation, and reference of each column in the warehouse, has been the key to effective data governance that assists tasks like data quality monitoring, storage refactoring, and workflow migration. Unfortunately, existing systems introduce overheads by integration with query execution or fail to achieve satisfying accuracy for column lineage. In this paper, we demonstrate LINEAGEX, a lightweight Python library that infers column level lineage from SQL queries and visualizes it through an interactive interface. LINEAGEX achieves high coverage and accuracy for column lineage extraction by intelligently traversing query parse trees and handling ambiguities. The demonstration walks through use cases of building lineage graphs and troubleshooting data quality issues. LINEAGEX is open sourced at https://github.com/sfu-db/lineagex and our video demonstration is at https://youtu.be/5LaBBDDitlw

Paper Structure

This paper contains 4 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Lineage extraction from query logs without a database connection.
  • Figure 2: The lineage graph for Example \ref{['sec:example']}. Existing tools like SQLLineage sqllineage would miss columns in the dashed red rectangle and return wrong entries in the solid red rectangle, while the yellow is the correct lineage
  • Figure 3: An illustration of LineageX.
  • Figure 4: Sample AST and traverse order
  • Figure 5: The User Interface of LineageX.

Theorems & Definitions (1)

  • Example 1