Table of Contents
Fetching ...

EasyTUS: A Comprehensive Framework for Fast and Accurate Table Union Search across Data Lakes

Tim Otto

TL;DR

EasyTUS tackles Table Union Search across multi-source data lakes by leveraging zero-shot, LLM-based table serialization and embedding, paired with ANN vector search in a cross-data-lake framework. It eliminates fine-tuning, uses a simple yet scalable offline-online pipeline, and introduces TUSBench to ensure reproducible benchmarking across diverse data lakes. Empirical results show substantial improvements in Mean Average Precision (MAP) and dramatic speedups in data preparation and query processing, with robust performance even when metadata is absent. The framework’s modular design and standard benchmarking environment position it for broad applicability and future enhancements with newer embedding models.

Abstract

Data lakes enable easy maintenance of heterogeneous data in its native form. While this flexibility can accelerate data ingestion, it shifts the complexity of data preparation and query processing to data discovery tasks. One such task is Table Union Search (TUS), which identifies tables that can be unioned with a given input table. In this work, we present EasyTUS, a comprehensive framework that leverages Large Language Models (LLMs) to perform efficient and scalable Table Union Search across data lakes. EasyTUS implements the search pipeline as three modular steps: Table Serialization for consistent formatting and sampling, Table Representation that utilizes LLMs to generate embeddings, and Vector Search that leverages approximate nearest neighbor indexing for semantic matching. To enable reproducible and systematic evaluation, in this paper, we also introduce TUSBench, a novel standardized benchmarking environment within the EasyTUS framework. TUSBench supports unified comparisons across approaches and data lakes, promoting transparency and progress in the field. Our experiments using TUSBench show that EasyTUS consistently outperforms most of the state-of the-art approaches, achieving improvements in average of up to 34.3% in Mean Average Precision (MAP), up to 79.2x speedup in data preparation, and up to 7.7x faster query processing performance. Furthermore, EasyTUS maintains strong performance even in metadata-absent settings, highlighting its robustness and adaptability across data lakes.

EasyTUS: A Comprehensive Framework for Fast and Accurate Table Union Search across Data Lakes

TL;DR

EasyTUS tackles Table Union Search across multi-source data lakes by leveraging zero-shot, LLM-based table serialization and embedding, paired with ANN vector search in a cross-data-lake framework. It eliminates fine-tuning, uses a simple yet scalable offline-online pipeline, and introduces TUSBench to ensure reproducible benchmarking across diverse data lakes. Empirical results show substantial improvements in Mean Average Precision (MAP) and dramatic speedups in data preparation and query processing, with robust performance even when metadata is absent. The framework’s modular design and standard benchmarking environment position it for broad applicability and future enhancements with newer embedding models.

Abstract

Data lakes enable easy maintenance of heterogeneous data in its native form. While this flexibility can accelerate data ingestion, it shifts the complexity of data preparation and query processing to data discovery tasks. One such task is Table Union Search (TUS), which identifies tables that can be unioned with a given input table. In this work, we present EasyTUS, a comprehensive framework that leverages Large Language Models (LLMs) to perform efficient and scalable Table Union Search across data lakes. EasyTUS implements the search pipeline as three modular steps: Table Serialization for consistent formatting and sampling, Table Representation that utilizes LLMs to generate embeddings, and Vector Search that leverages approximate nearest neighbor indexing for semantic matching. To enable reproducible and systematic evaluation, in this paper, we also introduce TUSBench, a novel standardized benchmarking environment within the EasyTUS framework. TUSBench supports unified comparisons across approaches and data lakes, promoting transparency and progress in the field. Our experiments using TUSBench show that EasyTUS consistently outperforms most of the state-of the-art approaches, achieving improvements in average of up to 34.3% in Mean Average Precision (MAP), up to 79.2x speedup in data preparation, and up to 7.7x faster query processing performance. Furthermore, EasyTUS maintains strong performance even in metadata-absent settings, highlighting its robustness and adaptability across data lakes.

Paper Structure

This paper contains 25 sections, 6 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of Table Union Search on a single data lake. The query table (left) and potential unionable tables from the data lake (right) are highlighted in green. Tables that are not unionable are marked in red.
  • Figure 2: Overview of the EasyTUS architecture for Table Union Search across multiple data lakes. The core steps, Table Serialization and Table Representation, are executed in both the offline and online phases. In the offline phase, embedding vectors for all tables from individual data lakes are generated and persisted, while in the online phase, an approximate nearest neighbor search is performed using the query table vector and the persisted embeddings.
  • Figure 3: Overview of the TUSBench environment setup, illustrated with two example data lakes, namely TUS-SANTOS and ECB Union, and two example approaches, Starmie and EasyTUS. Blue boxes represent provided components, orange boxes indicate generated intermediates, green boxes denote customizable metrics, and red boxes highlight shareable results.
  • Figure 4: Mean Average Precision (MAP@$k$) on the tus_santos data lake benchmark. Values of k are plotted on the x-axis. Similar trends are observed for other data lakes as well.
  • Figure 5: Comparison of EasyTUS (ET-J and ET-O) against the state-of-the-art in terms of Mean Average Precision (MAP@$k$) and Average Recall (AR@$k$) across selected benchmark data lakes with metadata. Values of $k$ are plotted on the x-axis.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 1