Table of Contents
Fetching ...

TabINR: An Implicit Neural Representation Framework for Tabular Data Imputation

Vincent Ochs, Florentin Bieder, Sidaty el Hadramy, Paul Friedrich, Stephanie Taha-Mehlitz, Anas Taha, Philippe C. Cattin

TL;DR

Missing values in tabular data hamper predictive modeling, especially with heterogeneous feature types and limited samples. The authors propose TabINR, an implicit neural representation framework that treats the table as a neural function $\,\hat{D}_{ij}=f_\theta(\lambda_i,c_j)$ parameterized by learnable row and feature embeddings, with test-time latent optimization to personalize imputations for unseen rows. Compared against classical and deep baselines across twelve real-world datasets and multiple missingness mechanisms, TabINR achieves competitive or superior reconstruction accuracy, particularly in high-dimensional settings, while offering fast inference and a simple, memory-efficient architecture. The work demonstrates INR-based representations as a unified paradigm for tabular learning and points to future extensions for non-random missingness, larger scales, and multimodal data integration, enabling broader applicability in real-world decision-making pipelines.

Abstract

Tabular data builds the basis for a wide range of applications, yet real-world datasets are frequently incomplete due to collection errors, privacy restrictions, or sensor failures. As missing values degrade the performance or hinder the applicability of downstream models, and while simple imputing strategies tend to introduce bias or distort the underlying data distribution, we require imputers that provide high-quality imputations, are robust across dataset sizes and yield fast inference. We therefore introduce TabINR, an auto-decoder based Implicit Neural Representation (INR) framework that models tables as neural functions. Building on recent advances in generalizable INRs, we introduce learnable row and feature embeddings that effectively deal with the discrete structure of tabular data and can be inferred from partial observations, enabling instance adaptive imputations without modifying the trained model. We evaluate our framework across a diverse range of twelve real-world datasets and multiple missingness mechanisms, demonstrating consistently strong imputation accuracy, mostly matching or outperforming classical (KNN, MICE, MissForest) and deep learning based models (GAIN, ReMasker), with the clearest gains on high-dimensional datasets.

TabINR: An Implicit Neural Representation Framework for Tabular Data Imputation

TL;DR

Missing values in tabular data hamper predictive modeling, especially with heterogeneous feature types and limited samples. The authors propose TabINR, an implicit neural representation framework that treats the table as a neural function parameterized by learnable row and feature embeddings, with test-time latent optimization to personalize imputations for unseen rows. Compared against classical and deep baselines across twelve real-world datasets and multiple missingness mechanisms, TabINR achieves competitive or superior reconstruction accuracy, particularly in high-dimensional settings, while offering fast inference and a simple, memory-efficient architecture. The work demonstrates INR-based representations as a unified paradigm for tabular learning and points to future extensions for non-random missingness, larger scales, and multimodal data integration, enabling broader applicability in real-world decision-making pipelines.

Abstract

Tabular data builds the basis for a wide range of applications, yet real-world datasets are frequently incomplete due to collection errors, privacy restrictions, or sensor failures. As missing values degrade the performance or hinder the applicability of downstream models, and while simple imputing strategies tend to introduce bias or distort the underlying data distribution, we require imputers that provide high-quality imputations, are robust across dataset sizes and yield fast inference. We therefore introduce TabINR, an auto-decoder based Implicit Neural Representation (INR) framework that models tables as neural functions. Building on recent advances in generalizable INRs, we introduce learnable row and feature embeddings that effectively deal with the discrete structure of tabular data and can be inferred from partial observations, enabling instance adaptive imputations without modifying the trained model. We evaluate our framework across a diverse range of twelve real-world datasets and multiple missingness mechanisms, demonstrating consistently strong imputation accuracy, mostly matching or outperforming classical (KNN, MICE, MissForest) and deep learning based models (GAIN, ReMasker), with the clearest gains on high-dimensional datasets.

Paper Structure

This paper contains 19 sections, 6 equations, 27 figures, 5 tables.

Figures (27)

  • Figure 1: Directed acyclic graphs illustrating common missing-data mechanisms for a partially observed variable $X$. White circles represent fully observed variables, gray circles represent unobserved or partially observed components, and squares denote the missingness indicators $R_X$. Arrows indicate causal or probabilistic dependence.
  • Figure 2: The proposed TabINR framework. During training, we jointly optimize the network $f_\theta$, the row embeddings $\Lambda$, as well as feature embeddings $C$. Once the network is trained, new instances can be added by only optimizing a new row embedding $\lambda_{\text{new}}$, keeping $f_\theta$ and $C$ fixed.
  • Figure 3: Overall performance of TabINR and six baselines on 12.0 benchmark datasets under MCAR with 0.1 missingness ratio. The results are shown as the mean and standard deviation of RMSE, and AUROC scores (AUROC is only applicable to datasets with classification tasks).
  • Figure 4: Inference-time comparison across imputers (lower is better). Bars show mean seconds per dataset over 5.0 runs; error bars denote $\pm$1.0$\,\mathrm{SD}$. While Mean/Mode is trivially fastest, TabINR and ReMasker achieve sub-0.25s inference once trained, whereas iterative baselines (KNN, MICE, MissForest) are markedly slower on higher-dimensional datasets. Results shown for MAR with $p_{\text{miss}}=0.3$; trends are consistent under MCAR/MNAR.
  • Figure 5: Sensitivity analysis of TabINR on the letter dataset under MCAR scenarios. The results are shown in terms of RMSE and AUROC, with the scores measured with respect to (a) the dataset size, (b) the number of features, and (c) the missingness ratio. The default setting is as follows: dataset size = 20000.0, number of features = 16.0, and missingness ratio = 0.1.
  • ...and 22 more figures