Table of Contents
Fetching ...

mDAE : modified Denoising AutoEncoder for missing data imputation

Mariette Dupuy, Marie Chavent, Remi Dubois

TL;DR

An ablation study shows the benefit of using this modified loss function and an overcomplete structure, in terms of Root Mean Squared Error (RMSE) of reconstruction, on several UCI Machine Learning Repository datasets.

Abstract

This paper introduces a methodology based on Denoising AutoEncoder (DAE) for missing data imputation. The proposed methodology, called mDAE hereafter, results from a modification of the loss function and a straightforward procedure for choosing the hyper-parameters. An ablation study shows on several UCI Machine Learning Repository datasets, the benefit of using this modified loss function and an overcomplete structure, in terms of Root Mean Squared Error (RMSE) of reconstruction. This numerical study is completed by comparing the mDAE methodology with eight other methods (four standard and four more recent). A criterion called Mean Distance to Best (MDB) is proposed to measure how a method performs globally well on all datasets. This criterion is defined as the mean (over the datasets) of the distances between the RMSE of the considered method and the RMSE of the best method. According to this criterion, the mDAE methodology was consistently ranked among the top methods (along with SoftImput and missForest), while the four more recent methods were systematically ranked last. The Python code of the numerical study will be available on GitHub so that results can be reproduced or generalized with other datasets and methods.

mDAE : modified Denoising AutoEncoder for missing data imputation

TL;DR

An ablation study shows the benefit of using this modified loss function and an overcomplete structure, in terms of Root Mean Squared Error (RMSE) of reconstruction, on several UCI Machine Learning Repository datasets.

Abstract

This paper introduces a methodology based on Denoising AutoEncoder (DAE) for missing data imputation. The proposed methodology, called mDAE hereafter, results from a modification of the loss function and a straightforward procedure for choosing the hyper-parameters. An ablation study shows on several UCI Machine Learning Repository datasets, the benefit of using this modified loss function and an overcomplete structure, in terms of Root Mean Squared Error (RMSE) of reconstruction. This numerical study is completed by comparing the mDAE methodology with eight other methods (four standard and four more recent). A criterion called Mean Distance to Best (MDB) is proposed to measure how a method performs globally well on all datasets. This criterion is defined as the mean (over the datasets) of the distances between the RMSE of the considered method and the RMSE of the best method. According to this criterion, the mDAE methodology was consistently ranked among the top methods (along with SoftImput and missForest), while the four more recent methods were systematically ranked last. The Python code of the numerical study will be available on GitHub so that results can be reproduced or generalized with other datasets and methods.

Paper Structure

This paper contains 11 sections, 10 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Scheme of a basic AutoEncoder (AE).
  • Figure 2: Scheme of a denoising AutoEncoder (DAE). Red crosses represent the values in $N({\mathbf x}_i)$ randomly set to 0.
  • Figure 3: Scheme of a DAE directly applied on pre-imputed data. Violet dots in ${\tilde{\mathbf x}}_i$ represent the missing values set to 0. Red crosses in $N({\tilde{\mathbf x}}_i)$ represent the values randomly set to 0.
  • Figure 4: Scheme of a mDAE. Violet dots in ${\tilde{\mathbf x}}_i$ represent the missing values set to 0. Violet dots in ${\tilde{\mathbf z}}_i$ represent the predicted values set to 0. Red crosses in $N({\tilde{\mathbf x}}_i)$ represent the values randomly set to 0
  • Figure 5: A grid of 6 simple structures where $p$ is the number of units of the input layer.
  • ...and 8 more figures