Table of Contents
Fetching ...

TabImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer

Jacob Feitelberg, Dwaipayan Saha, Kyuseong Choi, Zaid Ahmad, Anish Agarwal, Raaz Dwivedi

TL;DR

Missing data in tabular datasets hampers downstream analysis, and no default imputation method reliably spans real-world domains. The authors introduce TabImpute, a pre-trained transformer that provides zero-shot imputations without fitting or tuning, built on TabPFN with an entry-wise featurization and a synthetic-data training pipeline. They also launch MissBench, a benchmark with 42 OpenML datasets and 13 MNAR patterns, and propose TabImpute+—an adaptive ensemble of TabImpute and EWF-TabPFN that achieves state-of-the-art imputation accuracy across patterns. The work demonstrates strong generalization to diverse domains and provides open-source code and models to support broader adoption and future extensions.

Abstract

Missing data is a pervasive problem in tabular settings. Existing solutions range from simple averaging to complex generative adversarial networks. However, due to huge variance in performance across real-world domains and time-consuming hyperparameter tuning, no default imputation method exists. Building on TabPFN, a recent tabular foundation model for supervised learning, we propose TabImpute, a pre-trained transformer that delivers accurate and fast zero-shot imputations requiring no fitting or hyperparameter tuning at inference-time. To train and evaluate TabImpute, we introduce (i) an entry-wise featurization for tabular settings, which enables a $100\times$ speedup over the previous TabPFN imputation method, (ii) a synthetic training data generation pipeline incorporating realistic missingness patterns, which boosts test-time performance, and (iii) MissBench, a comprehensive benchmark for evaluation of imputation methods with $42$ OpenML datasets and $13$ missingness patterns. MissBench spans domains such as medicine, finance, and engineering, showcasing TabImpute's robust performance compared to $11$ established imputation methods.

TabImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer

TL;DR

Missing data in tabular datasets hampers downstream analysis, and no default imputation method reliably spans real-world domains. The authors introduce TabImpute, a pre-trained transformer that provides zero-shot imputations without fitting or tuning, built on TabPFN with an entry-wise featurization and a synthetic-data training pipeline. They also launch MissBench, a benchmark with 42 OpenML datasets and 13 MNAR patterns, and propose TabImpute+—an adaptive ensemble of TabImpute and EWF-TabPFN that achieves state-of-the-art imputation accuracy across patterns. The work demonstrates strong generalization to diverse domains and provides open-source code and models to support broader adoption and future extensions.

Abstract

Missing data is a pervasive problem in tabular settings. Existing solutions range from simple averaging to complex generative adversarial networks. However, due to huge variance in performance across real-world domains and time-consuming hyperparameter tuning, no default imputation method exists. Building on TabPFN, a recent tabular foundation model for supervised learning, we propose TabImpute, a pre-trained transformer that delivers accurate and fast zero-shot imputations requiring no fitting or hyperparameter tuning at inference-time. To train and evaluate TabImpute, we introduce (i) an entry-wise featurization for tabular settings, which enables a speedup over the previous TabPFN imputation method, (ii) a synthetic training data generation pipeline incorporating realistic missingness patterns, which boosts test-time performance, and (iii) MissBench, a comprehensive benchmark for evaluation of imputation methods with OpenML datasets and missingness patterns. MissBench spans domains such as medicine, finance, and engineering, showcasing TabImpute's robust performance compared to established imputation methods.

Paper Structure

This paper contains 64 sections, 13 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Evaluation on real-world OpenML data: MissBench. We compare TabImpute and TabImpute+ (ensembled method) with $11$ other popular methods on MissBench. In panel (a), we plot the imputation accuracy (defined as 1 - normalized RMSE), which is calculated for each method, normalized within a dataset, and averaged across datasets and $13$ missingness patterns. Error bars indicate 95% confidence intervals. In panel (b), we compare the runtime per table entry. Any method not labeled (GPU) is tested on a CPU because that method is not GPU-compatible. TabPFN on CPU is significantly slower, so we do not include it. See \ref{['sec:arch']} for our exact computing specifications and \ref{['sec:empirical']} for accuracy score methodology.
  • Figure 2: Selection of synthetic missingness patterns implemented in MissBench. Blue entries indicate observed values, and gray entries are unobserved.
  • Figure 3: Overview of our contributions. The first row demonstrates TabPFN's imputation method, which performs iterative column-by-column imputation. We build on this by introducing an entry-wise featurization, as shown in the second row. We create a new synthetic data-generator for missingness data to train our model, TabImpute, shown in green (\ref{['sec:syn-data']} and \ref{['sec:training']}, respectively). Lastly, we ensemble TabImpute with TabPFN's model using our features to create TabImpute+ (\ref{['sec:ensemble']}). We adaptively evaluate all the imputers on the comprehensive and rich set of OpenML datasets with many missingness patterns applied (\ref{['sec:empirical']}).
  • Figure 4: Imputation accuracy versus fraction of missingness for MCAR. TabImpute+ performs the best when missingness is higher because it is a generative model that fits to the data in context.