LDI: Localized Data Imputation for Text-Rich Tables
Soroush Omidvartehrani, Davood Rafiei
TL;DR
This work tackles missing value imputation in text-rich tables, where dependencies are implicit and data are heterogeneous. It introduces LDI, a framework that identifies a localized subset of informative attributes and representative tuples to condition a Large Language Model, enabling accurate, scalable, and interpretable imputations. Empirical results show that attribute- and tuple-selection within LDI yields state-of-the-art performance across four real-world datasets, with notable improvements in accuracy and explainability, and robustness to varying missingness levels; the approach also works with small local LLMs, underscoring practicality. The availability of code and data supports reproducibility and adoption in real-world data management tasks, with clear directions for extending dependency discovery and explainability in future work.
Abstract
Missing values are pervasive in real-world tabular data and can significantly impair downstream analysis. Imputing them is especially challenging in text-rich tables, where dependencies are implicit, complex, and dispersed across long textual fields. Recent work has explored using Large Language Models (LLMs) for data imputation, yet existing approaches typically process entire tables or loosely related contexts, which can compromise accuracy, scalability, and explainability. We introduce LDI, a novel framework that leverages LLMs through localized reasoning, selecting a compact, contextually relevant subset of attributes and tuples for each missing value. This targeted selection reduces noise, improves scalability, and provides transparent attribution by revealing which data influenced each prediction. Through extensive experiments on real and synthetic datasets, we demonstrate that LDI consistently outperforms state-of-the-art imputation methods, achieving up to 8% higher accuracy with hosted LLMs and even greater gains with local models. The improved interpretability and robustness also make LDI well-suited for high-stakes data management applications.
