Table of Contents
Fetching ...

Efficient and Effective Table-Centric Table Union Search in Data Lakes

Yongkang Sun, Zhihao Ding, Huiqiang Wang, Reynold Cheng, Jieming Shi

Abstract

In data lakes, information on the same subject is often fragmented across multiple tables. Table union search aims to find the top-k tables that can be unioned with a query table to extend it with more rows, without relying on metadata or ground-truth labels. Existing methods are mainly column-centric: they focus on modeling column unionability scores using column embeddings, which are then used throughout the search process for column matching, filtering, and aggregation. However, this overlooks holistic table-level semantics, which may result in suboptimal rankings and inefficiencies. We introduce TACTUS, a novel table-centric method for table union search. Unlike prior work that searches from columns to tables, we search in a table-first way and examine columns only in the final step. During offline processing, we directly generate table embeddings for holistic, table-level unionability scoring by designing table-level representation techniques, including positive table pair construction to simulate unionable tables, two-pronged negative table sampling to avoid latent positives and mine hard negatives to enhance representation quality, and attentive table encoding for effective embeddings. During online search, we first develop a table-centric adaptive candidate retrieval method that efficiently selects a compact, high-quality candidate pool by leveraging the distribution of table-level unionability scores induced by table embeddings. We then inspect columns only within this compact candidate set and design a dual-evidence reranking technique that integrates table-level and column-level scores to refine the final top-k results. Extensive experiments on real-world datasets show that TACTUS significantly improves result quality while being much faster than existing methods in both offline and online processing, often by an order of magnitude.

Efficient and Effective Table-Centric Table Union Search in Data Lakes

Abstract

In data lakes, information on the same subject is often fragmented across multiple tables. Table union search aims to find the top-k tables that can be unioned with a query table to extend it with more rows, without relying on metadata or ground-truth labels. Existing methods are mainly column-centric: they focus on modeling column unionability scores using column embeddings, which are then used throughout the search process for column matching, filtering, and aggregation. However, this overlooks holistic table-level semantics, which may result in suboptimal rankings and inefficiencies. We introduce TACTUS, a novel table-centric method for table union search. Unlike prior work that searches from columns to tables, we search in a table-first way and examine columns only in the final step. During offline processing, we directly generate table embeddings for holistic, table-level unionability scoring by designing table-level representation techniques, including positive table pair construction to simulate unionable tables, two-pronged negative table sampling to avoid latent positives and mine hard negatives to enhance representation quality, and attentive table encoding for effective embeddings. During online search, we first develop a table-centric adaptive candidate retrieval method that efficiently selects a compact, high-quality candidate pool by leveraging the distribution of table-level unionability scores induced by table embeddings. We then inspect columns only within this compact candidate set and design a dual-evidence reranking technique that integrates table-level and column-level scores to refine the final top-k results. Extensive experiments on real-world datasets show that TACTUS significantly improves result quality while being much faster than existing methods in both offline and online processing, often by an order of magnitude.
Paper Structure (18 sections, 9 equations, 11 figures, 8 tables, 2 algorithms)

This paper contains 18 sections, 9 equations, 11 figures, 8 tables, 2 algorithms.

Figures (11)

  • Figure 1: Tables A and B are unionable as they share the same subject (bus ridership) and have semantically aligned columns. Table C, on a different subject (railway maintenance), is non-unionable with table A; unioning them leads to inconsistent table content. The links between columns indicate the highly matched column pairs with column unionability scores from starmie, which are aggregated into table unionability scores (shown top-right); it ranks non-unionable table C (2.056) above unionable table B (1.446) for query table A.
  • Figure 2: The Overview of TACTUS
  • Figure 3: Illustration of Two-pronged Negative Sampling
  • Figure 4: Illustration of the Attentive Table Encoder
  • Figure 5: Precision $P@k$ and Relative Recall $R@k/R_{ub}@k$ with Varied $k$
  • ...and 6 more figures

Theorems & Definitions (3)

  • Example 1
  • Example 2
  • Definition 1: Top-$k$ Table Union Search