Table of Contents
Fetching ...

A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language Models

Ahatsham Hayat, Mohammad Rashedul Hasan

TL;DR

CRILM addresses missing data in tabular datasets by converting numeric entries with missing values into missingness-aware textual descriptors produced by pre-trained language models. Large LMs (>10B) generate contextually relevant descriptors, which are used to create a missingness-aware dataset $\mathbf{X_{missingness\_aware}}$ that smaller LMs (\lt 10B) fine-tune for downstream classification, enabling zero-shot descriptor generation and cost-efficient transfer learning. Across MCAR, MAR, and MNAR, CRILM outperforms traditional imputation baselines, with particular strength in MNAR settings where biases are most pronounced, achieving up to ~10% gains on several datasets. The work demonstrates a practical, context-rich approach to data imputation that extends LM capabilities to structured data tasks and suggests broad potential for resource-constrained environments and downstream NLP-integrated analyses.

Abstract

This paper presents a novel approach named \textbf{C}ontextually \textbf{R}elevant \textbf{I}mputation leveraging pre-trained \textbf{L}anguage \textbf{M}odels (\textbf{CRILM}) for handling missing data in tabular datasets. Instead of relying on traditional numerical estimations, CRILM uses pre-trained language models (LMs) to create contextually relevant descriptors for missing values. This method aligns datasets with LMs' strengths, allowing large LMs to generate these descriptors and small LMs to be fine-tuned on the enriched datasets for enhanced downstream task performance. Our evaluations demonstrate CRILM's superior performance and robustness across MCAR, MAR, and challenging MNAR scenarios, with up to a 10\% improvement over the best-performing baselines. By mitigating biases, particularly in MNAR settings, CRILM improves downstream task performance and offers a cost-effective solution for resource-constrained environments.

A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language Models

TL;DR

CRILM addresses missing data in tabular datasets by converting numeric entries with missing values into missingness-aware textual descriptors produced by pre-trained language models. Large LMs (>10B) generate contextually relevant descriptors, which are used to create a missingness-aware dataset that smaller LMs (\lt 10B) fine-tune for downstream classification, enabling zero-shot descriptor generation and cost-efficient transfer learning. Across MCAR, MAR, and MNAR, CRILM outperforms traditional imputation baselines, with particular strength in MNAR settings where biases are most pronounced, achieving up to ~10% gains on several datasets. The work demonstrates a practical, context-rich approach to data imputation that extends LM capabilities to structured data tasks and suggests broad potential for resource-constrained environments and downstream NLP-integrated analyses.

Abstract

This paper presents a novel approach named \textbf{C}ontextually \textbf{R}elevant \textbf{I}mputation leveraging pre-trained \textbf{L}anguage \textbf{M}odels (\textbf{CRILM}) for handling missing data in tabular datasets. Instead of relying on traditional numerical estimations, CRILM uses pre-trained language models (LMs) to create contextually relevant descriptors for missing values. This method aligns datasets with LMs' strengths, allowing large LMs to generate these descriptors and small LMs to be fine-tuned on the enriched datasets for enhanced downstream task performance. Our evaluations demonstrate CRILM's superior performance and robustness across MCAR, MAR, and challenging MNAR scenarios, with up to a 10\% improvement over the best-performing baselines. By mitigating biases, particularly in MNAR settings, CRILM improves downstream task performance and offers a cost-effective solution for resource-constrained environments.
Paper Structure (18 sections, 3 figures, 6 tables)

This paper contains 18 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An overview of CRILM.
  • Figure 2: [RQ1]: Comparison of CRILM and baseline imputation methods across MCAR, MAR, and MNAR missingness patterns using Llama and FLAN-T5 models. Evaluation involves post-imputation LM-based downstream task performance, with CRILM fine-tuned on missingness-aware contextual datasets and baseline methods on contextual datasets. "No Imputation" cases show LM performance on complete datasets without missing values.
  • Figure 3: [RQ2]: Impact of feature-specific vs. generic ("NaN", "Missing value", and "Value not recorded") missingness descriptors on LM Performance in MCAR scenario.