TabImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer

Jacob Feitelberg; Dwaipayan Saha; Kyuseong Choi; Zaid Ahmad; Anish Agarwal; Raaz Dwivedi

TabImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer

Jacob Feitelberg, Dwaipayan Saha, Kyuseong Choi, Zaid Ahmad, Anish Agarwal, Raaz Dwivedi

TL;DR

Missing data in tabular datasets hampers downstream analysis, and no default imputation method reliably spans real-world domains. The authors introduce TabImpute, a pre-trained transformer that provides zero-shot imputations without fitting or tuning, built on TabPFN with an entry-wise featurization and a synthetic-data training pipeline. They also launch MissBench, a benchmark with 42 OpenML datasets and 13 MNAR patterns, and propose TabImpute+—an adaptive ensemble of TabImpute and EWF-TabPFN that achieves state-of-the-art imputation accuracy across patterns. The work demonstrates strong generalization to diverse domains and provides open-source code and models to support broader adoption and future extensions.

Abstract

Missing data is a pervasive problem in tabular settings. Existing solutions range from simple averaging to complex generative adversarial networks. However, due to huge variance in performance across real-world domains and time-consuming hyperparameter tuning, no default imputation method exists. Building on TabPFN, a recent tabular foundation model for supervised learning, we propose TabImpute, a pre-trained transformer that delivers accurate and fast zero-shot imputations requiring no fitting or hyperparameter tuning at inference-time. To train and evaluate TabImpute, we introduce (i) an entry-wise featurization for tabular settings, which enables a $100\times$ speedup over the previous TabPFN imputation method, (ii) a synthetic training data generation pipeline incorporating realistic missingness patterns, which boosts test-time performance, and (iii) MissBench, a comprehensive benchmark for evaluation of imputation methods with $42$ OpenML datasets and $13$ missingness patterns. MissBench spans domains such as medicine, finance, and engineering, showcasing TabImpute's robust performance compared to $11$ established imputation methods.

TabImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer

TL;DR

Abstract

TabImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)