Table of Contents
Fetching ...

Gen-T: Table Reclamation in Data Lakes

Grace Fan, Roee Shraga, Renée J. Miller

TL;DR

Gen-T tackles table reclamation in data lakes by identifying originating tables whose integration reproduces a given Source Table as closely as possible. It introduces the error-aware instance similarity (EIS) score and a two-stage pipeline: (1) Table Discovery to retrieve and refine candidate/originating tables, and (2) Table Reclamation via integration using a matrix-driven, three-valued representation and a focused set of operators including Outer Union and unary transformations. The method demonstrates strong effectiveness, achieving up to 5× higher reclamation of Source Table values than baselines across large, noisy data lakes, with scalability to datasets containing tens of thousands of tables and source tables with up to 1K rows. Gen-T also shows promise in generalizing beyond synthetic benchmarks to real-world domains (T2D Gold, WDC), and offers avenues for future work in relaxing the key assumption, embedding partial reclamations in updated lakes, and validating AI-generated tabular outputs.

Abstract

We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomplete or inconsistent. To do this, we define a new measure of table similarity, called error-aware instance similarity, to measure how close a reclaimed table is to a Source Table, a measure grounded in instance similarity used in data exchange. Our search covers not only SELECT-PROJECT- JOIN queries, but integration queries with unions, outerjoins, and the unary operators subsumption and complementation that have been shown to be important in data integration and fusion. Using reclamation, a data scientist can understand if any tables in a repository can be used to exactly reclaim a tuple in the Source. If not, one can understand if this is due to differences in values or to incompleteness in the data. Our solution, Gen-T, performs table discovery to retrieve a set of candidate tables from the table repository, filters these down to a set of originating tables, then integrates these tables to reclaim the Source as closely as possible. We show that our solution, while approximate, is accurate, efficient and scalable in the size of the table repository with experiments on real data lakes containing up to 15K tables, where the average number of tuples varies from small (web tables) to extremely large (open data tables) up to 1M tuples.

Gen-T: Table Reclamation in Data Lakes

TL;DR

Gen-T tackles table reclamation in data lakes by identifying originating tables whose integration reproduces a given Source Table as closely as possible. It introduces the error-aware instance similarity (EIS) score and a two-stage pipeline: (1) Table Discovery to retrieve and refine candidate/originating tables, and (2) Table Reclamation via integration using a matrix-driven, three-valued representation and a focused set of operators including Outer Union and unary transformations. The method demonstrates strong effectiveness, achieving up to 5× higher reclamation of Source Table values than baselines across large, noisy data lakes, with scalability to datasets containing tens of thousands of tables and source tables with up to 1K rows. Gen-T also shows promise in generalizing beyond synthetic benchmarks to real-world domains (T2D Gold, WDC), and offers avenues for future work in relaxing the key assumption, embedding partial reclamations in updated lakes, and validating AI-generated tabular outputs.

Abstract

We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomplete or inconsistent. To do this, we define a new measure of table similarity, called error-aware instance similarity, to measure how close a reclaimed table is to a Source Table, a measure grounded in instance similarity used in data exchange. Our search covers not only SELECT-PROJECT- JOIN queries, but integration queries with unions, outerjoins, and the unary operators subsumption and complementation that have been shown to be important in data integration and fusion. Using reclamation, a data scientist can understand if any tables in a repository can be used to exactly reclaim a tuple in the Source. If not, one can understand if this is due to differences in values or to incompleteness in the data. Our solution, Gen-T, performs table discovery to retrieve a set of candidate tables from the table repository, filters these down to a set of originating tables, then integrates these tables to reclaim the Source as closely as possible. We show that our solution, while approximate, is accurate, efficient and scalable in the size of the table repository with experiments on real data lakes containing up to 15K tables, where the average number of tuples varies from small (web tables) to extremely large (open data tables) up to 1M tuples.
Paper Structure (33 sections, 6 theorems, 12 equations, 9 figures, 4 tables, 5 algorithms)

This paper contains 33 sections, 6 theorems, 12 equations, 9 figures, 4 tables, 5 algorithms.

Key Result

Theorem 1

Given two tables that contain no duplicate tuples, and no tuples that can be subsumed or complemented, for all SPJU queries, there exists an equivalent query consisting of only Outer Union and the four unary operators (selection, projection, complementation, and subsumption).The proof is included in

Figures (9)

  • Figure 1: A news article reports the top blue table. A user has access to Microsoft's diversity report, which seems to contradict the article (bottom green table).
  • Figure 2: Gen-T Architecture. Given a Source Table, Gen-T finds a set of originating tables (Table Discovery), produces a reclaimed Source Table from the (Table Reclamation), and returns the originating tables and the reclaimed Source Table.
  • Figure 3: Source Table (in green) contains applicants' information, such as ID, Name, Age, Gender, and Education Level. Tables A, B, C, D (in blue) are possible tables from which the Source Table's instances originated. Missing values and inconsistent values w.r.t. Source Table are depicted in yellow ('---') and red, respectively. Tables on the right (in yellow) are possible integrations of tables resulting from integration methods using Full Disjunction (FD) and outer join ($\mathbin{ \mkern-5.8mu\bowtie\mkern-5.8mu }$).
  • Figure 4: Aligned tuples between a Source Table (left green table) and two possible reclaimed tables (right yellow tables) from Figure \ref{['fig:raw_integ']}, aligned based on key column 'ID'.
  • Figure 5: Matrix initialization and integration of tables A, B, C given the Source Table from Figure \ref{['fig:raw_integ']} simulate their table integration. The result of matrix integration is equivalent to the matrix representation of the table integration result.
  • ...and 4 more figures

Theorems & Definitions (15)

  • Example 1
  • Example 2
  • Example 3
  • Definition 1
  • Definition 2
  • Example 4
  • Definition 3: Table Reclamation
  • Theorem 1: Representative Operators
  • Example 5
  • Example 6
  • ...and 5 more