Beyond Random Missingness: Clinically Rethinking for Healthcare Time Series Imputation
Linglong Qian, Yiyuan Yang, Wenjie Du, Jun Wang, Richard Dobsoni, Zina Ibrahim
TL;DR
This paper argues that random masking inadequately reflects real-world clinical missingness in healthcare time series and systematically evaluates clinically informed masking strategies across eleven imputation methods on the PhysioNet 2012 dataset. It introduces two masking schemes Augmentation RMEO and Overlay RMOD, along with pre-masking vs mini-batch masking and normalization timings NBΜ vs NAM, to study their impact on imputation accuracy and downstream mortality prediction. The findings show masking choices significantly influence both imputation metrics and clinical task performance, with recurrent models like BRITS showing robustness, while some architectures such as TimesNet may underperform in preserving predictive patterns. The work emphasizes the need for clinically grounded evaluation frameworks to ensure reliable deployment of imputation methods in healthcare settings, as high imputation accuracy does not always translate to better downstream decision support.
Abstract
This study investigates the impact of masking strategies on time series imputation models in healthcare settings. While current approaches predominantly rely on random masking for model evaluation, this practice fails to capture the structured nature of missing patterns in clinical data. Using the PhysioNet Challenge 2012 dataset, we analyse how different masking implementations affect both imputation accuracy and downstream clinical predictions across eleven imputation methods. Our results demonstrate that masking choices significantly influence model performance, while recurrent architectures show more consistent performance across strategies. Analysis of downstream mortality prediction reveals that imputation accuracy doesn't necessarily translate to optimal clinical prediction capabilities. Our findings emphasise the need for clinically-informed masking strategies that better reflect real-world missing patterns in healthcare data, suggesting current evaluation frameworks may need reconsideration for reliable clinical deployment.
