SMUTF: Schema Matching Using Generative Tags and Hybrid Features
Yu Zhang, Mei Di, Haozheng Luo, Chenwei Xu, Richard Tzong-Han Tsai
TL;DR
This work tackles large-scale schema matching for open-domain tabular data by introducing SMUTF, a hybrid system that combines HXL-style generative tags, rule-based features, deep embeddings, and an XGBoost predictor to determine column matches. It leverages generative tagging and cross-domain supervision to improve accuracy and efficiency, and introduces the HDXSM dataset to enable realistic evaluation in humanitarian and open-domain settings. Empirical results show SMUTF achieving substantial gains in macro-F1 and macro-AUC across multiple public datasets and the new HDXSM dataset, with ablations confirming the importance of value features and HXL tagging. The approach offers practical benefits for data integration and dataset discovery in diverse domains, and suggests future work in richer tagging, multi-modal SM, and graph-based modeling.
Abstract
We introduce SMUTF (Schema Matching Using Generative Tags and Hybrid Features), a unique approach for large-scale tabular data schema matching (SM), which assumes that supervised learning does not affect performance in open-domain tasks, thereby enabling effective cross-domain matching. This system uniquely combines rule-based feature engineering, pre-trained language models, and generative large language models. In an innovative adaptation inspired by the Humanitarian Exchange Language, we deploy "generative tags" for each data column, enhancing the effectiveness of SM. SMUTF exhibits extensive versatility, working seamlessly with any pre-existing pre-trained embeddings, classification methods, and generative models. Recognizing the lack of extensive, publicly available datasets for SM, we have created and open-sourced the HDXSM dataset from the public humanitarian data. We believe this to be the most exhaustive SM dataset currently available. In evaluations across various public datasets and the novel HDXSM dataset, SMUTF demonstrated exceptional performance, surpassing existing state-of-the-art models in terms of accuracy and efficiency, and improving the F1 score by 11.84% and the AUC of ROC by 5.08%. Code is available at https://github.com/fireindark707/Python-Schema-Matching.
