Table of Contents
Fetching ...

Observatory: Characterizing Embeddings of Relational Tables

Tianji Cong, Madelon Hulsebos, Zhenjie Sun, Paul Groth, H. V. Jagadish

TL;DR

Observatory presents a principled framework for intrinsic analysis of relational-table embeddings by defining eight relational and data-distribution properties with quantitative measures. It evaluates nine models (covering vanilla language models and table-embedding models) across curated datasets, revealing that many embeddings fail to reflect functional dependencies and can be sensitive to table structure and perturbations. The results offer concrete guidance for model selection and task design, such as preferring certain models for context-sensitive joins or recognizing sampling trade-offs for large tables. By open-sourcing implementation and datasets, Observatory enables researchers to extend intrinsic analyses to new architectures and tasks, accelerating robust application of tabular embeddings in real-world workflows.

Abstract

Language models and specialized table embedding models have recently demonstrated strong performance on many tasks over tabular data. Researchers and practitioners are keen to leverage these models in many new application contexts; but limited understanding of the strengths and weaknesses of these models, and the table representations they generate, makes the process of finding a suitable model for a given task reliant on trial and error. There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage. To address this need, we propose Observatory, a formal framework to systematically analyze embedding representations of relational tables. Motivated both by invariants of the relational data model and by statistical considerations regarding data distributions, we define eight primitive properties, and corresponding measures to quantitatively characterize table embeddings for these properties. Based on these properties, we define an extensible framework to evaluate language and table embedding models. We collect and synthesize a suite of datasets and use Observatory to analyze nine such models. Our analysis provides insights into the strengths and weaknesses of learned representations over tables. We find, for example, that some models are sensitive to table structure such as column order, that functional dependencies are rarely reflected in embeddings, and that specialized table embedding models have relatively lower sample fidelity. Such insights help researchers and practitioners better anticipate model behaviors and select appropriate models for their downstream tasks, while guiding researchers in the development of new models.

Observatory: Characterizing Embeddings of Relational Tables

TL;DR

Observatory presents a principled framework for intrinsic analysis of relational-table embeddings by defining eight relational and data-distribution properties with quantitative measures. It evaluates nine models (covering vanilla language models and table-embedding models) across curated datasets, revealing that many embeddings fail to reflect functional dependencies and can be sensitive to table structure and perturbations. The results offer concrete guidance for model selection and task design, such as preferring certain models for context-sensitive joins or recognizing sampling trade-offs for large tables. By open-sourcing implementation and datasets, Observatory enables researchers to extend intrinsic analyses to new architectures and tasks, accelerating robust application of tabular embeddings in real-world workflows.

Abstract

Language models and specialized table embedding models have recently demonstrated strong performance on many tasks over tabular data. Researchers and practitioners are keen to leverage these models in many new application contexts; but limited understanding of the strengths and weaknesses of these models, and the table representations they generate, makes the process of finding a suitable model for a given task reliant on trial and error. There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage. To address this need, we propose Observatory, a formal framework to systematically analyze embedding representations of relational tables. Motivated both by invariants of the relational data model and by statistical considerations regarding data distributions, we define eight primitive properties, and corresponding measures to quantitatively characterize table embeddings for these properties. Based on these properties, we define an extensible framework to evaluate language and table embedding models. We collect and synthesize a suite of datasets and use Observatory to analyze nine such models. Our analysis provides insights into the strengths and weaknesses of learned representations over tables. We find, for example, that some models are sensitive to table structure such as column order, that functional dependencies are rarely reflected in embeddings, and that specialized table embedding models have relatively lower sample fidelity. Such insights help researchers and practitioners better anticipate model behaviors and select appropriate models for their downstream tasks, while guiding researchers in the development of new models.
Paper Structure (38 sections, 7 equations, 13 figures, 5 tables)

This paper contains 38 sections, 7 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview of Observatory and how it solicits understanding of opaque table embedding models by measuring properties motivated by the relational data model and data distributions. We illustrate the framework for two out of eight properties: 1) row order insignificance, and 2) sample fidelity.
  • Figure 2: Illustration of row permutations.
  • Figure 3: Table with a functional dependency country$\rightarrow$continent. The colors illustrate different FD groups determined by the unique values in the country column.
  • Figure 4: A table (without header) comprising textual and non-textual data columns.
  • Figure 5: Cosine similarity and MCV distributions of column (top), row (middle), and table (bottom) embeddings from row shuffling. Across three levels of embeddings, table embedding models exhibit comparably lower cosine similarity while both language and table embedding models may exhibit high MCV.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Definition 1: Table Embedding Characterization
  • Example
  • Example
  • Example
  • Example
  • Example