Table of Contents
Fetching ...

On LLM-Enhanced Mixed-Type Data Imputation with High-Order Message Passing

Jianwei Wang, Kai Wang, Ying Zhang, Wenjie Zhang, Xiwei Xu, Xuemin Lin

TL;DR

This paper tackles missing data imputation for mixed-type tabular data by introducing UnIMP, an LLM-enhanced framework that leverages a cell-oriented hypergraph and Bidirectional High-order Message Passing to capture global-local information and high-order dependencies. The architecture couples an LLM backbone with BiHMP adapters and an XFusion fusion module, using chunking and progressive masking in a pre-train/fine-tune pipeline to efficiently train on large, heterogeneous tables. Theoretical analyses and extensive experiments on 10 real-world datasets demonstrate that UnIMP outperforms state-of-the-art baselines in numerical, categorical, and text imputation, while also improving efficiency and generalization. The work provides a principled approach to integrating LLMs with structured, high-order information for robust, scalable mixed-type imputation with practical impact for data quality in AI systems.

Abstract

Missing data imputation, which aims to impute the missing values in the raw datasets to achieve the completeness of datasets, is crucial for modern data-driven models like large language models (LLMs) and has attracted increasing interest over the past decades. Despite its importance, existing solutions for missing data imputation either 1) only support numerical and categorical data or 2) show an unsatisfactory performance due to their design prioritizing text data and the lack of key properties for tabular data imputation. In this paper, we propose UnIMP, a Unified IMPutation framework that leverages LLM and high-order message passing to enhance the imputation of mixed-type data including numerical, categorical, and text data. Specifically, we first introduce a cell-oriented hypergraph to model the table. We then propose BiHMP, an efficient Bidirectional High-order Message-Passing network to aggregate global-local information and high-order relationships on the constructed hypergraph while capturing the inter-column heterogeneity and intra-column homogeneity. To effectively and efficiently align the capacity of the LLM with the information aggregated by BiHMP, we introduce Xfusion, which, together with BiHMP, acts as adapters for the LLM. We follow a pre-training and fine-tuning pipeline to train UnIMP, integrating two optimizations: chunking technique, which divides tables into smaller chunks to enhance efficiency; and progressive masking technique, which gradually adapts the model to learn more complex data patterns. Both theoretical proofs and empirical experiments on 10 real world datasets highlight the superiority of UnIMP over existing techniques.

On LLM-Enhanced Mixed-Type Data Imputation with High-Order Message Passing

TL;DR

This paper tackles missing data imputation for mixed-type tabular data by introducing UnIMP, an LLM-enhanced framework that leverages a cell-oriented hypergraph and Bidirectional High-order Message Passing to capture global-local information and high-order dependencies. The architecture couples an LLM backbone with BiHMP adapters and an XFusion fusion module, using chunking and progressive masking in a pre-train/fine-tune pipeline to efficiently train on large, heterogeneous tables. Theoretical analyses and extensive experiments on 10 real-world datasets demonstrate that UnIMP outperforms state-of-the-art baselines in numerical, categorical, and text imputation, while also improving efficiency and generalization. The work provides a principled approach to integrating LLMs with structured, high-order information for robust, scalable mixed-type imputation with practical impact for data quality in AI systems.

Abstract

Missing data imputation, which aims to impute the missing values in the raw datasets to achieve the completeness of datasets, is crucial for modern data-driven models like large language models (LLMs) and has attracted increasing interest over the past decades. Despite its importance, existing solutions for missing data imputation either 1) only support numerical and categorical data or 2) show an unsatisfactory performance due to their design prioritizing text data and the lack of key properties for tabular data imputation. In this paper, we propose UnIMP, a Unified IMPutation framework that leverages LLM and high-order message passing to enhance the imputation of mixed-type data including numerical, categorical, and text data. Specifically, we first introduce a cell-oriented hypergraph to model the table. We then propose BiHMP, an efficient Bidirectional High-order Message-Passing network to aggregate global-local information and high-order relationships on the constructed hypergraph while capturing the inter-column heterogeneity and intra-column homogeneity. To effectively and efficiently align the capacity of the LLM with the information aggregated by BiHMP, we introduce Xfusion, which, together with BiHMP, acts as adapters for the LLM. We follow a pre-training and fine-tuning pipeline to train UnIMP, integrating two optimizations: chunking technique, which divides tables into smaller chunks to enhance efficiency; and progressive masking technique, which gradually adapts the model to learn more complex data patterns. Both theoretical proofs and empirical experiments on 10 real world datasets highlight the superiority of UnIMP over existing techniques.
Paper Structure (24 sections, 17 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 24 sections, 17 equations, 6 figures, 7 tables, 2 algorithms.

Figures (6)

  • Figure 1: Comparisons of Frameworks.
  • Figure 2: Framework overview of UnIMP
  • Figure 3: Results of different missing mechanisms
  • Figure 4: Results of downstream classification
  • Figure 5: Imputation efficiency evaluation (in seconds). Besides, the numerical/categorical data and text data need 20824 seconds and 73221 seconds for pre-training UnIMP, respectively. 45381 seconds are needed for pre-training Table-GPT.
  • ...and 1 more figures