Table of Contents
Fetching ...

Bayes-MICE: A Bayesian Approach to Multiple Imputation for Time Series Data

Amuche Ibenegbu, Pierre Lafaye de Micheaux, Rohitash Chandra

Abstract

Time-series analysis is often affected by missing data, a common problem across several fields, including healthcare and environmental monitoring. Multiple Imputation by Chained Equations (MICE) has been prominent for imputing missing values through "fully conditional specification". We extend MICE using the Bayesian framework (Bayes-MICE), utilising Bayesian inference to impute missing values via Markov Chain Monte Carlo (MCMC) sampling to account for uncertainty in MICE model parameters and imputed values. We also include temporally informed initialisation and time-lagged features in the model to respect the sequential nature of time-series data. We evaluate the Bayes-MICE method using two real-world datasets (AirQuality and PhysioNet), and using both the Random Walk Metropolis (RWM) and the Metropolis-Adjusted Langevin Algorithm (MALA) samplers. Our results demonstrate that Bayes-MICE reduces imputation errors relative to the baseline methods over all variables and accounts for uncertainty in the imputation process, thereby providing a more accurate measure of imputation error. We also found that MALA converges faster than RWM, achieving comparable accuracy while providing more consistent posterior exploration. Overall, these findings suggest that the Bayes-MICE framework represents a practical and efficient approach to time-series imputation, balancing increased accuracy with meaningful quantification of uncertainty in various environmental and clinical settings.

Bayes-MICE: A Bayesian Approach to Multiple Imputation for Time Series Data

Abstract

Time-series analysis is often affected by missing data, a common problem across several fields, including healthcare and environmental monitoring. Multiple Imputation by Chained Equations (MICE) has been prominent for imputing missing values through "fully conditional specification". We extend MICE using the Bayesian framework (Bayes-MICE), utilising Bayesian inference to impute missing values via Markov Chain Monte Carlo (MCMC) sampling to account for uncertainty in MICE model parameters and imputed values. We also include temporally informed initialisation and time-lagged features in the model to respect the sequential nature of time-series data. We evaluate the Bayes-MICE method using two real-world datasets (AirQuality and PhysioNet), and using both the Random Walk Metropolis (RWM) and the Metropolis-Adjusted Langevin Algorithm (MALA) samplers. Our results demonstrate that Bayes-MICE reduces imputation errors relative to the baseline methods over all variables and accounts for uncertainty in the imputation process, thereby providing a more accurate measure of imputation error. We also found that MALA converges faster than RWM, achieving comparable accuracy while providing more consistent posterior exploration. Overall, these findings suggest that the Bayes-MICE framework represents a practical and efficient approach to time-series imputation, balancing increased accuracy with meaningful quantification of uncertainty in various environmental and clinical settings.

Paper Structure

This paper contains 27 sections, 19 equations, 7 figures, 6 tables, 2 algorithms.

Figures (7)

  • Figure 1: Time-lagged imputation mechanism in univariate MICE, where a missing value at time index $t$ is imputed by conditioning on observed neighbouring values from past and future lags.
  • Figure 2: The proposed Bayes-MICE imputation framework. The framework consists of temporal pattern detection, placeholder initialisation, MICE loop with lagged predictors and Bayesian modelling, while MCMC (RWM or MALA) provides posterior sampling and parameter updates. The final imputation is generated via posterior predictive draws.
  • Figure 3: Convergence diagnostics for $\tau^2$ (trace, marginal density, and ACF) across two chains and two samplers for one of the variables (HC03) from the physionet dataset.
  • Figure 4: Prediction accuracy comparison for the CO(GT) variable from the AirQuality dataset (top) and the HCO3 variable from the physioNet dataset (bottom). Each subplot panel shows:(top-left) predicted versus true values, (top-right) residuals against the true values, (bottom-left) sorted true values with corresponding model predictions, and (bottom-right) box plots of absolute errors.
  • Figure 5: Imputation error patterns over time index of each method across different datasets, with the AirQuality CO(GT) variable shown in the top panel and the PhysioNet HCO3 variable shown in the bottom panel
  • ...and 2 more figures