Table of Contents
Fetching ...

Multiple Imputation Methods under Extreme Values

Enzo Porto Brasil

TL;DR

This study evaluates six MI methods from MICE to understand their performance when missing data co-occur with extreme values. Through Monte Carlo simulations on a three-variable normal design, it compares parametric, donor-based, and nonparametric MI approaches under MCAR, with downstream models switching between OLS (clean) and elastic net (contaminated). Key findings show parametric MI often yields tighter predictive tails and better out-of-sample CV-MSE, while donor-based and ML approaches can reduce slope bias at the expense of heavier tails; sample size mitigates dispersion but does not fully erase contamination-induced distortions. The work provides practical guidance: choose MI methods considering missingness, extremes, and downstream modeling goals, and emphasizes examining the missingness mechanism and the presence of extremes before method selection. The results have direct implications for applied research where extremes and MAR/MAR patterns interact with imputation, influencing both predictive performance and inferential validity.

Abstract

Missing data are ubiquitous in empirical databases, yet statistical analyses typically require complete data matrices. Multiple imputation offers a principled solution for filling these gaps. This study evaluates the performance of several multiple imputation methods, both in the presence and absence of extreme values, using the MICE package in R. Through Monte Carlo simulations, we generated incomplete data sets with three variables and assessed each imputation method within regression models. The results indicate that the linear regression based imputation method showed the best overall predictive performance (CV-MSE), whereas the sparse model approach was generally less efficient. Our findings underscore the relevance of extreme values when selecting an imputation strategy and highlight sample size, proportion of missingness, presence of extremes, and the type of fitted model as key determinants of performance. Despite its limitations, the study offers practical recommendations for researchers, stressing the need to examine the missingness mechanism and the occurrence of extreme values before choosing an imputation method.

Multiple Imputation Methods under Extreme Values

TL;DR

This study evaluates six MI methods from MICE to understand their performance when missing data co-occur with extreme values. Through Monte Carlo simulations on a three-variable normal design, it compares parametric, donor-based, and nonparametric MI approaches under MCAR, with downstream models switching between OLS (clean) and elastic net (contaminated). Key findings show parametric MI often yields tighter predictive tails and better out-of-sample CV-MSE, while donor-based and ML approaches can reduce slope bias at the expense of heavier tails; sample size mitigates dispersion but does not fully erase contamination-induced distortions. The work provides practical guidance: choose MI methods considering missingness, extremes, and downstream modeling goals, and emphasizes examining the missingness mechanism and the presence of extremes before method selection. The results have direct implications for applied research where extremes and MAR/MAR patterns interact with imputation, influencing both predictive performance and inferential validity.

Abstract

Missing data are ubiquitous in empirical databases, yet statistical analyses typically require complete data matrices. Multiple imputation offers a principled solution for filling these gaps. This study evaluates the performance of several multiple imputation methods, both in the presence and absence of extreme values, using the MICE package in R. Through Monte Carlo simulations, we generated incomplete data sets with three variables and assessed each imputation method within regression models. The results indicate that the linear regression based imputation method showed the best overall predictive performance (CV-MSE), whereas the sparse model approach was generally less efficient. Our findings underscore the relevance of extreme values when selecting an imputation strategy and highlight sample size, proportion of missingness, presence of extremes, and the type of fitted model as key determinants of performance. Despite its limitations, the study offers practical recommendations for researchers, stressing the need to examine the missingness mechanism and the occurrence of extreme values before choosing an imputation method.
Paper Structure (23 sections, 6 equations, 12 figures, 28 tables, 1 algorithm)

This paper contains 23 sections, 6 equations, 12 figures, 28 tables, 1 algorithm.

Figures (12)

  • Figure 1: Relationships (correlation plot, marginal histograms and empirical densities) for clean data sets with $n=500$, $P_{\text{ext}}=0.10$, and $\rho=0.6$. Axes are $y, x_1, x_2$. Colorbars ('Level') indicate relative bivariate density per row
  • Figure 2: Relationships (correlation plot, marginal histograms and empirical densities) for contaminated data (with extreme values) sets with $n=500$, $P_{\text{ext}}=0.10$, and $\rho=0.6$. Axes are $y, x_1, x_2$. Colorbars ('Level') indicate relative bivariate density per row
  • Figure 3: Predictive MSE densities (clean vs contaminated with extremes data), MSE boxplots, and QQ-plots (predicted vs true quantiles of $y$) across six MI methods (T1--T6). For $\boldsymbol{n=20}$, ordered by $P_{\text{ext}}$, and $P_{\text{miss}}$ (panel 1 of 4). Clean data are analyzed with OLS and contaminated data with elastic net. Each subpanel shows the design values ($n$, $P_{\text{ext}}$, $P_{\text{miss}}$, iter, n.sim, and $\rho$).
  • Figure 4: Predictive MSE densities (clean vs contaminated with extremes data), MSE boxplots, and QQ-plots (predicted vs true quantiles of $y$) across six MI methods (T1--T6). For $\boldsymbol{n=20}$, ordered by $P_{\text{ext}}$, and $P_{\text{miss}}$ (panel 2 of 4). Clean data are analyzed with OLS and contaminated data with elastic net. Each subpanel shows the design values ($n$, $P_{\text{ext}}$, $P_{\text{miss}}$, iter, n.sim, and $\rho$).
  • Figure 5: Predictive MSE densities (clean vs contaminated with extremes data), MSE boxplots, and QQ-plots (predicted vs true quantiles of $y$) across six MI methods (T1--T6). For $\boldsymbol{n=20}$ and $\boldsymbol{n=40}$, ordered by $n$, $P_{\text{ext}}$, and $P_{\text{miss}}$ (panel 3 of 4). Clean data are analyzed with OLS and contaminated data with elastic net. Each subpanel shows the design values ($n$, $P_{\text{ext}}$, $P_{\text{miss}}$, iter, n.sim, and $\rho$).
  • ...and 7 more figures