EasyTUS: A Comprehensive Framework for Fast and Accurate Table Union Search across Data Lakes
Tim Otto
TL;DR
EasyTUS tackles Table Union Search across multi-source data lakes by leveraging zero-shot, LLM-based table serialization and embedding, paired with ANN vector search in a cross-data-lake framework. It eliminates fine-tuning, uses a simple yet scalable offline-online pipeline, and introduces TUSBench to ensure reproducible benchmarking across diverse data lakes. Empirical results show substantial improvements in Mean Average Precision (MAP) and dramatic speedups in data preparation and query processing, with robust performance even when metadata is absent. The framework’s modular design and standard benchmarking environment position it for broad applicability and future enhancements with newer embedding models.
Abstract
Data lakes enable easy maintenance of heterogeneous data in its native form. While this flexibility can accelerate data ingestion, it shifts the complexity of data preparation and query processing to data discovery tasks. One such task is Table Union Search (TUS), which identifies tables that can be unioned with a given input table. In this work, we present EasyTUS, a comprehensive framework that leverages Large Language Models (LLMs) to perform efficient and scalable Table Union Search across data lakes. EasyTUS implements the search pipeline as three modular steps: Table Serialization for consistent formatting and sampling, Table Representation that utilizes LLMs to generate embeddings, and Vector Search that leverages approximate nearest neighbor indexing for semantic matching. To enable reproducible and systematic evaluation, in this paper, we also introduce TUSBench, a novel standardized benchmarking environment within the EasyTUS framework. TUSBench supports unified comparisons across approaches and data lakes, promoting transparency and progress in the field. Our experiments using TUSBench show that EasyTUS consistently outperforms most of the state-of the-art approaches, achieving improvements in average of up to 34.3% in Mean Average Precision (MAP), up to 79.2x speedup in data preparation, and up to 7.7x faster query processing performance. Furthermore, EasyTUS maintains strong performance even in metadata-absent settings, highlighting its robustness and adaptability across data lakes.
