Gen-T: Table Reclamation in Data Lakes
Grace Fan, Roee Shraga, Renée J. Miller
TL;DR
Gen-T tackles table reclamation in data lakes by identifying originating tables whose integration reproduces a given Source Table as closely as possible. It introduces the error-aware instance similarity (EIS) score and a two-stage pipeline: (1) Table Discovery to retrieve and refine candidate/originating tables, and (2) Table Reclamation via integration using a matrix-driven, three-valued representation and a focused set of operators including Outer Union and unary transformations. The method demonstrates strong effectiveness, achieving up to 5× higher reclamation of Source Table values than baselines across large, noisy data lakes, with scalability to datasets containing tens of thousands of tables and source tables with up to 1K rows. Gen-T also shows promise in generalizing beyond synthetic benchmarks to real-world domains (T2D Gold, WDC), and offers avenues for future work in relaxing the key assumption, embedding partial reclamations in updated lakes, and validating AI-generated tabular outputs.
Abstract
We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomplete or inconsistent. To do this, we define a new measure of table similarity, called error-aware instance similarity, to measure how close a reclaimed table is to a Source Table, a measure grounded in instance similarity used in data exchange. Our search covers not only SELECT-PROJECT- JOIN queries, but integration queries with unions, outerjoins, and the unary operators subsumption and complementation that have been shown to be important in data integration and fusion. Using reclamation, a data scientist can understand if any tables in a repository can be used to exactly reclaim a tuple in the Source. If not, one can understand if this is due to differences in values or to incompleteness in the data. Our solution, Gen-T, performs table discovery to retrieve a set of candidate tables from the table repository, filters these down to a set of originating tables, then integrates these tables to reclaim the Source as closely as possible. We show that our solution, while approximate, is accurate, efficient and scalable in the size of the table repository with experiments on real data lakes containing up to 15K tables, where the average number of tuples varies from small (web tables) to extremely large (open data tables) up to 1M tuples.
