Table of Contents
Fetching ...

Graph Neural Network Approach to Semantic Type Detection in Tables

Ehsan Hoseinzade, Ke Wang

TL;DR

The paper tackles semantic table column type detection under language-model input constraints by introducing GAIT, a framework that stacks a graph neural network on top of a strong single-column predictor (RECA). By representing each table as a graph whose nodes are columns and whose edges capture dependencies, GAIT integrates intra-table dependencies with inter-table context to refine predictions beyond what standalone language models can achieve. Empirical results on Webtables and Semtab show that GAIT, particularly the GAT variant, outperforms existing baselines, with notable gains for low-frequency classes, demonstrating the value of modeling column dependencies. The approach advances practical semantic tagging for data cleaning, schema matching, and data discovery by enabling robust, scalable multi-column predictions without overloading the language model's token budget.

Abstract

This study addresses the challenge of detecting semantic column types in relational tables, a key task in many real-world applications. While language models like BERT have improved prediction accuracy, their token input constraints limit the simultaneous processing of intra-table and inter-table information. We propose a novel approach using Graph Neural Networks (GNNs) to model intra-table dependencies, allowing language models to focus on inter-table information. Our proposed method not only outperforms existing state-of-the-art algorithms but also offers novel insights into the utility and functionality of various GNN types for semantic type detection. The code is available at https://github.com/hoseinzadeehsan/GAIT

Graph Neural Network Approach to Semantic Type Detection in Tables

TL;DR

The paper tackles semantic table column type detection under language-model input constraints by introducing GAIT, a framework that stacks a graph neural network on top of a strong single-column predictor (RECA). By representing each table as a graph whose nodes are columns and whose edges capture dependencies, GAIT integrates intra-table dependencies with inter-table context to refine predictions beyond what standalone language models can achieve. Empirical results on Webtables and Semtab show that GAIT, particularly the GAT variant, outperforms existing baselines, with notable gains for low-frequency classes, demonstrating the value of modeling column dependencies. The approach advances practical semantic tagging for data cleaning, schema matching, and data discovery by enabling robust, scalable multi-column predictions without overloading the language model's token budget.

Abstract

This study addresses the challenge of detecting semantic column types in relational tables, a key task in many real-world applications. While language models like BERT have improved prediction accuracy, their token input constraints limit the simultaneous processing of intra-table and inter-table information. We propose a novel approach using Graph Neural Networks (GNNs) to model intra-table dependencies, allowing language models to focus on inter-table information. Our proposed method not only outperforms existing state-of-the-art algorithms but also offers novel insights into the utility and functionality of various GNN types for semantic type detection. The code is available at https://github.com/hoseinzadeehsan/GAIT
Paper Structure (11 sections, 1 equation, 3 figures, 3 tables)

This paper contains 11 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The two tables on the right both have the column containing the values "Paris", "Ottawa" and "London". Without considering information coming from other columns it is difficult for a single-column prediction model to detect the actual semantic types of these columns. The multi-column prediction will label these columns correctly by jointly predicting all columns in a table.
  • Figure 2: The framework of GAIT: GAIT adds a GNN learning on top of the single-column prediction module, which is RECA in this work. The output of RECA is a class distribution for each column in a table, which provides the initial hidden state of the node representing that column in the GNN. For a table with $n$ columns, RECA is performed $n$ times. Then, the GNN learns the best representations of the hidden states of all nodes to minimize a loss function, through Message Passing that models the dependencies between columns.
  • Figure 3: The macro f-score of $\textnormal{GAIT}_{\textnormal{GAT}}$ and RECA on Semtab dataset, for High, Medium, and Low-frequency classes.