Table of Contents
Fetching ...

Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets

Camillo Maria Caruso, Paolo Soda, Valerio Guarrasi

TL;DR

Not Another Imputation Method (NAIM) tackles missing values in tabular data by avoiding traditional imputation. It introduces a transformer-based encoder with feature-specific embeddings for both categorical and numerical features and a novel masked self-attention that completely excludes missing data from influence, augmented by a regularization technique that simulates missing data during training. Across five public tabular datasets, NAIM outperforms a wide range of ML and DL baselines, including models paired with three imputers and other DL architectures with intrinsic missing-data handling. The work demonstrates improved predictive performance and resilience to missingness, and provides open-source code to foster adoption and further research.

Abstract

Handling missing values in tabular datasets presents a significant challenge in training and testing artificial intelligence models, an issue usually addressed using imputation techniques. Here we introduce "Not Another Imputation Method" (NAIM), a novel transformer-based model specifically designed to address this issue without the need for traditional imputation techniques. NAIM's ability to avoid the necessity of imputing missing values and to effectively learn from available data relies on two main techniques: the use of feature-specific embeddings to encode both categorical and numerical features also handling missing inputs; the modification of the masked self-attention mechanism to completely mask out the contributions of missing data. Additionally, a novel regularization technique is introduced to enhance the model's generalization capability from incomplete data. We extensively evaluated NAIM on 5 publicly available tabular datasets, demonstrating its superior performance over 6 state-of-the-art machine learning models and 5 deep learning models, each paired with 3 different imputation techniques when necessary. The results highlight the efficacy of NAIM in improving predictive performance and resilience in the presence of missing data. To facilitate further research and practical application in handling missing data without traditional imputation methods, we made the code for NAIM available at https://github.com/cosbidev/NAIM.

Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets

TL;DR

Not Another Imputation Method (NAIM) tackles missing values in tabular data by avoiding traditional imputation. It introduces a transformer-based encoder with feature-specific embeddings for both categorical and numerical features and a novel masked self-attention that completely excludes missing data from influence, augmented by a regularization technique that simulates missing data during training. Across five public tabular datasets, NAIM outperforms a wide range of ML and DL baselines, including models paired with three imputers and other DL architectures with intrinsic missing-data handling. The work demonstrates improved predictive performance and resilience to missingness, and provides open-source code to foster adoption and further research.

Abstract

Handling missing values in tabular datasets presents a significant challenge in training and testing artificial intelligence models, an issue usually addressed using imputation techniques. Here we introduce "Not Another Imputation Method" (NAIM), a novel transformer-based model specifically designed to address this issue without the need for traditional imputation techniques. NAIM's ability to avoid the necessity of imputing missing values and to effectively learn from available data relies on two main techniques: the use of feature-specific embeddings to encode both categorical and numerical features also handling missing inputs; the modification of the masked self-attention mechanism to completely mask out the contributions of missing data. Additionally, a novel regularization technique is introduced to enhance the model's generalization capability from incomplete data. We extensively evaluated NAIM on 5 publicly available tabular datasets, demonstrating its superior performance over 6 state-of-the-art machine learning models and 5 deep learning models, each paired with 3 different imputation techniques when necessary. The results highlight the efficacy of NAIM in improving predictive performance and resilience in the presence of missing data. To facilitate further research and practical application in handling missing data without traditional imputation methods, we made the code for NAIM available at https://github.com/cosbidev/NAIM.
Paper Structure (21 sections, 17 equations, 15 figures, 9 tables)

This paper contains 21 sections, 17 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: The architecture of NAIM, composed of the Feature Embedding, the $\mathit{Encoder}$ equipped with the Masked Multi-Head Attention mechanism, and the final classification head bib:transformer.
  • Figure 2: The proposed Feature Embedding process for tabular data. In the example, the feature vector $x$ has $4$ features: $2$ categorical ($x^{cat}$) and $2$ numerical ($x^{num}$). The colors (, , ) and the shapes (, , ) are examples of possible values for the first categorical feature $x_1^{cat}$ and for the second one $x_2^{cat}$, respectively, whilst stands for a non-missing numerical feature and for its value. Finally, indicates the padding index related to missing features for both types of features. In the Feature Embedding block, we can see how the embedding $e$ of the feature vector $x$ is composed of the concatenation of embedded representations of the categorical and numerical features, denoted as $e^{cat}$ and $e^{num}$, respectively. These representations are composed by the concatenation of the vectors associated with each feature value, selected using the feature-specific lookup tables $E_i^{cat}$ and $E_i^{num}$.
  • Figure 3: The proposed masked self-attention mechanism, designed to effectively ignore the impact of missing entries within the attention matrix. In the example, the $QK^T$ matrix, obtained by the multiplication of the $Q$ and $K$ representations, is reported as an example of how the contributions of the different features, identified by different colors, mix up together. Next, the classic masked self-attention mechanism is applied, and some of the contributions of the missing features (indicated with and ) remain. Then, the proposed attention mechanism ensures that the influence of these missing values is completely masked out, before the multiplication by the representation $V$ of the sample.
  • Figure 3: Average AUC performance and standard error (in brackets) of the experiments across the $5$ different datasets. To facilitate the analysis we highlighted the best performance in each column in bold.
  • Figure 4: The proposed regularization strategy performed at every epoch before feeding the sample to the model. The colors (, , ) and the shapes (, ) are examples of possible values for the categorical features, whilst stands for a numerical feature. In the example are reported $3$ feature vectors and their masked versions created in $5$ different epochs. It should be noted that, when some features are originally missing, only the non-missing entries can be masked.
  • ...and 10 more figures