Table of Contents
Fetching ...

Something's Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks

Allaa Boutaleb, Bernd Amann, Hubert Naacke, Rafael Angarita

TL;DR

This paper critiques Table Union Search benchmarks by showing that excessive schema overlap, semantic simplicity, and ground-truth noise allow simple baselines and general embeddings to rival or outperform sophisticated TUS models. Through systematic benchmark analysis and diagnostic baselines, it reveals that current scores often reflect artifact-driven signals rather than genuine semantic understanding. It introduces metrics for ground-truth reliability, demonstrates inconsistencies via LLM adjudication, and articulates design principles and practical pathways for more realistic, discriminative benchmarks. The work argues that progress in semantic table union search should be validated against benchmarks that capture real-world variability, domain complexity, and nuanced notions of unionability to ensure meaningful improvements have practical impact.

Abstract

Recent table representation learning and data discovery methods tackle table union search (TUS) within data lakes, which involves identifying tables that can be unioned with a given query table to enrich its content. These methods are commonly evaluated using benchmarks that aim to assess semantic understanding in real-world TUS tasks. However, our analysis of prominent TUS benchmarks reveals several limitations that allow simple baselines to perform surprisingly well, often outperforming more sophisticated approaches. This suggests that current benchmark scores are heavily influenced by dataset-specific characteristics and fail to effectively isolate the gains from semantic understanding. To address this, we propose essential criteria for future benchmarks to enable a more realistic and reliable evaluation of progress in semantic table union search.

Something's Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks

TL;DR

This paper critiques Table Union Search benchmarks by showing that excessive schema overlap, semantic simplicity, and ground-truth noise allow simple baselines and general embeddings to rival or outperform sophisticated TUS models. Through systematic benchmark analysis and diagnostic baselines, it reveals that current scores often reflect artifact-driven signals rather than genuine semantic understanding. It introduces metrics for ground-truth reliability, demonstrates inconsistencies via LLM adjudication, and articulates design principles and practical pathways for more realistic, discriminative benchmarks. The work argues that progress in semantic table union search should be validated against benchmarks that capture real-world variability, domain complexity, and nuanced notions of unionability to ensure meaningful improvements have practical impact.

Abstract

Recent table representation learning and data discovery methods tackle table union search (TUS) within data lakes, which involves identifying tables that can be unioned with a given query table to enrich its content. These methods are commonly evaluated using benchmarks that aim to assess semantic understanding in real-world TUS tasks. However, our analysis of prominent TUS benchmarks reveals several limitations that allow simple baselines to perform surprisingly well, often outperforming more sophisticated approaches. This suggests that current benchmark scores are heavily influenced by dataset-specific characteristics and fail to effectively isolate the gains from semantic understanding. To address this, we propose essential criteria for future benchmarks to enable a more realistic and reliable evaluation of progress in semantic table union search.

Paper Structure

This paper contains 56 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Distribution of Exact Column Name Overlap (Top) and String Value Overlap (Bottom) Coefficients for Ground Truth Unionable Pairs Across Benchmarks. Colored circles represent mean values; numbers on the right indicate total pairwise relationships considered.
  • Figure 2: Distribution of exact column name and tuple overlap across different benchmarks, broken down by data type (String, Numeric, Datetime, Other). Each subplot represents a benchmark, showing the percentage of ground truth pairs falling into different overlap ranges.
  • Figure 3: Examples of Ugen where pairs labeled unionable in the original ground truth exhibit significant semantic/structural divergence suggesting non-unionability.
  • Figure 4: Examples of Ugen Pairs explicitly labeled as non-unionable in the original ground truth exhibiting strong compatibility suggesting unionability.
  • Figure 5: Examples of LB-Webtable Ground Truth Incompleteness.
  • ...and 1 more figures