SMUTF: Schema Matching Using Generative Tags and Hybrid Features

Yu Zhang; Mei Di; Haozheng Luo; Chenwei Xu; Richard Tzong-Han Tsai

SMUTF: Schema Matching Using Generative Tags and Hybrid Features

Yu Zhang, Mei Di, Haozheng Luo, Chenwei Xu, Richard Tzong-Han Tsai

TL;DR

This work tackles large-scale schema matching for open-domain tabular data by introducing SMUTF, a hybrid system that combines HXL-style generative tags, rule-based features, deep embeddings, and an XGBoost predictor to determine column matches. It leverages generative tagging and cross-domain supervision to improve accuracy and efficiency, and introduces the HDXSM dataset to enable realistic evaluation in humanitarian and open-domain settings. Empirical results show SMUTF achieving substantial gains in macro-F1 and macro-AUC across multiple public datasets and the new HDXSM dataset, with ablations confirming the importance of value features and HXL tagging. The approach offers practical benefits for data integration and dataset discovery in diverse domains, and suggests future work in richer tagging, multi-modal SM, and graph-based modeling.

Abstract

We introduce SMUTF (Schema Matching Using Generative Tags and Hybrid Features), a unique approach for large-scale tabular data schema matching (SM), which assumes that supervised learning does not affect performance in open-domain tasks, thereby enabling effective cross-domain matching. This system uniquely combines rule-based feature engineering, pre-trained language models, and generative large language models. In an innovative adaptation inspired by the Humanitarian Exchange Language, we deploy "generative tags" for each data column, enhancing the effectiveness of SM. SMUTF exhibits extensive versatility, working seamlessly with any pre-existing pre-trained embeddings, classification methods, and generative models. Recognizing the lack of extensive, publicly available datasets for SM, we have created and open-sourced the HDXSM dataset from the public humanitarian data. We believe this to be the most exhaustive SM dataset currently available. In evaluations across various public datasets and the novel HDXSM dataset, SMUTF demonstrated exceptional performance, surpassing existing state-of-the-art models in terms of accuracy and efficiency, and improving the F1 score by 11.84% and the AUC of ROC by 5.08%. Code is available at https://github.com/fireindark707/Python-Schema-Matching.

SMUTF: Schema Matching Using Generative Tags and Hybrid Features

TL;DR

Abstract

Paper Structure (38 sections, 5 equations, 5 figures, 9 tables)

This paper contains 38 sections, 5 equations, 5 figures, 9 tables.

Introduction
Related work
Schema Matching
Text Embedding with PLM
LLM-Based Approaches for Tabular Data
Metadata Generation with LLM
Semantic Type Detection and Table Understanding
SMUTF Methodology
Problem Definitions
Schema Matching Components
HXL-style Tags Generation
Rule-based Feature Extraction
Column Name Features
Value Features
Deep Embedding Similarity
...and 23 more sections

Figures (5)

Figure 1: Schema matching aims to discover relationships between columns.
Figure 2: The basic design of SMUTF comprises two primary elements: the generation of HXL-style tags and the calculation of similarity. Four additional computations are employed for measuring similarity. The outcome of these computations, the similarity score, is then used to to predict if two columns are a match.
Figure 3: Generating HXL-style tags using mt0-xl model
Figure 4: Performance of different methods upon different schema pairs of WikiData is explored. The metrics employed for assessment in the experiment is the F1.
Figure 5: Row Number Impact on Performance of Value-Based Methods

SMUTF: Schema Matching Using Generative Tags and Hybrid Features

TL;DR

Abstract

SMUTF: Schema Matching Using Generative Tags and Hybrid Features

Authors

TL;DR

Abstract

Table of Contents

Figures (5)