Table of Contents
Fetching ...

KGLink: A column type annotation method that combines knowledge graph and pre-trained language model

Yubo Wang, Hao Xin, Lei Chen

TL;DR

KGLink tackles semantic annotation of tabular data by hybridizing WikiData KG information with a pre-trained language model to address both the KG-type granularity gap and the lack of contextual cues in DL models. It introduces a knowledge-graph driven candidate-type extraction module and a multi-task DL framework that serializes tables and learns a column-type representation via a Distilled Masked Language Model loss, guided by an adaptive loss. Empirical results on SemTab and VizNet show KGLink achieving state-of-the-art or competitive performance, with notable data efficiency (roughly 60% of training data) and robust performance on numeric and non-numeric columns. The approach demonstrates the practical impact of integrating structured KG signals with large language models for scalable, accurate semantic annotation of tabular data.

Abstract

The semantic annotation of tabular data plays a crucial role in various downstream tasks. Previous research has proposed knowledge graph (KG)-based and deep learning-based methods, each with its inherent limitations. KG-based methods encounter difficulties annotating columns when there is no match for column cells in the KG. Moreover, KG-based methods can provide multiple predictions for one column, making it challenging to determine the semantic type with the most suitable granularity for the dataset. This type granularity issue limits their scalability. On the other hand, deep learning-based methods face challenges related to the valuable context missing issue. This occurs when the information within the table is insufficient for determining the correct column type. This paper presents KGLink, a method that combines WikiData KG information with a pre-trained deep learning language model for table column annotation, effectively addressing both type granularity and valuable context missing issues. Through comprehensive experiments on widely used tabular datasets encompassing numeric and string columns with varying type granularity, we showcase the effectiveness and efficiency of KGLink. By leveraging the strengths of KGLink, we successfully surmount challenges related to type granularity and valuable context issues, establishing it as a robust solution for the semantic annotation of tabular data.

KGLink: A column type annotation method that combines knowledge graph and pre-trained language model

TL;DR

KGLink tackles semantic annotation of tabular data by hybridizing WikiData KG information with a pre-trained language model to address both the KG-type granularity gap and the lack of contextual cues in DL models. It introduces a knowledge-graph driven candidate-type extraction module and a multi-task DL framework that serializes tables and learns a column-type representation via a Distilled Masked Language Model loss, guided by an adaptive loss. Empirical results on SemTab and VizNet show KGLink achieving state-of-the-art or competitive performance, with notable data efficiency (roughly 60% of training data) and robust performance on numeric and non-numeric columns. The approach demonstrates the practical impact of integrating structured KG signals with large language models for scalable, accurate semantic annotation of tabular data.

Abstract

The semantic annotation of tabular data plays a crucial role in various downstream tasks. Previous research has proposed knowledge graph (KG)-based and deep learning-based methods, each with its inherent limitations. KG-based methods encounter difficulties annotating columns when there is no match for column cells in the KG. Moreover, KG-based methods can provide multiple predictions for one column, making it challenging to determine the semantic type with the most suitable granularity for the dataset. This type granularity issue limits their scalability. On the other hand, deep learning-based methods face challenges related to the valuable context missing issue. This occurs when the information within the table is insufficient for determining the correct column type. This paper presents KGLink, a method that combines WikiData KG information with a pre-trained deep learning language model for table column annotation, effectively addressing both type granularity and valuable context missing issues. Through comprehensive experiments on widely used tabular datasets encompassing numeric and string columns with varying type granularity, we showcase the effectiveness and efficiency of KGLink. By leveraging the strengths of KGLink, we successfully surmount challenges related to type granularity and valuable context issues, establishing it as a robust solution for the semantic annotation of tabular data.
Paper Structure (17 sections, 17 equations, 10 figures, 5 tables)

This paper contains 17 sections, 17 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: An example from the SemTab dataset, if we only consider the type attribute, we would only obtain Human as the candidate type from the KG. This approach would overlook Cricketer and Cricket, which offer a finer granularity than Human and also provide valuable information for the column type annotation task.
  • Figure 2: Figure (a) and (b) present examples for the type granularity issue and valuable context missing issue, respectively. In Figure (a), for a column of basketball player names, the column type fetch from the KG could be Athlete, or Basketball player. These two types could all be correct types. However, in this dataset, the ground truth type (label) desired for this column could be Name since no finer granularity types such as Athlete exist in the dataset labels. A granularity gap exists between type Athlete or Basketball player and Name. Figure (b) highlights valuable context missing issues, posing challenges for deep learning-based models in performing column annotation tasks. The context information from columns two and three, which are irrelevant to the column type annotation task on column one, fails to provide any information close to the ground truth label: Cricketer. Consequently, it becomes arduous for deep learning-based models to accurately annotate the target column.
  • Figure 3: Overview of KGLink's model structure: KGLink integrates a knowledge graph (KG) component designed to filter out inappropriate numeric or date table cells. It proceeds with identifying $j$ candidate types $ct_0, \dots ct_j$ for each column. Following this, the table rows undergo sorting based on their linkage quality. The processed table incorporates labels for candidate-type entities in each column. This table is then serialized by the deep learning component in part 2, where feature vectors for each column are generated using information fetched from the KG in part 1. KGLink introduces the column type representation generation task as an subtask to further enhance prediction performance. This task generates a representation vector for each predicted column based on its cell and KG-extracted information. It then aims to optimize the gap between this vector and the representation vector of the column's label in the dataset. To optimize accuracy, the model utilizes a combined adaptive loss with trainable weights $\sigma_0$ and $\sigma_1$. Collectively, these elements contribute to the improved accuracy of column type predictions.
  • Figure 4: Overview of the KG candidate type extraction process: We break down this procedure into three steps. This division is designed to minimize noise from the KG, generate the feature sequence, and optimize the table for enhanced predictions in the subsequent deep learning-based model.
  • Figure 5: An example from the SemTab hassanzadeh_oktie_2019_3518539 dataset, showcasing three KG entity sets: $E_{m_0^0}$, $E_{m_1^0}$, and $E_{m_2^0}$. The figure includes a red link representing a connection between entities in the one-hop neighbor. For instance, the entity Rust, which corresponds to an album, has a one-hop neighbor entity: Peter Steele, representing a musician. This connection suggests a higher likelihood that these two entities are the correct representation for the cell mentions Rust and Peter Steele.
  • ...and 5 more figures