Missing data imputation for noisy time-series data and applications in healthcare
Lien P. Le, Xuan-Hien Nguyen Thi, Thu Nguyen, Michael A. Riegler, Pål Halvorsen, Binh T. Nguyen
TL;DR
This study addresses missing data in healthcare time series by comparing MICE-RF with state-of-the-art deep-learning imputers (SAITS, BRITS, Transformer) across missing-data rates from $10\%$ to $80\%$. It evaluates not only imputation accuracy via MAE but also downstream classification performance (F1-score, AUC, MCC) to capture denoising effects. The results show MICE-RF often yields the best MAE for univariate data at moderate missingness, while multivariate data without periodicity may benefit more from deep-learning approaches; importantly, imputation generally enhances downstream classification, illustrating denoising alongside filling gaps. The findings provide guidance on method selection based on data characteristics and highlight the practical impact of imputation on real-world healthcare analytics.
Abstract
Healthcare time series data is vital for monitoring patient activity but often contains noise and missing values due to various reasons such as sensor errors or data interruptions. Imputation, i.e., filling in the missing values, is a common way to deal with this issue. In this study, we compare imputation methods, including Multiple Imputation with Random Forest (MICE-RF) and advanced deep learning approaches (SAITS, BRITS, Transformer) for noisy, missing time series data in terms of MAE, F1-score, AUC, and MCC, across missing data rates (10 % - 80 %). Our results show that MICE-RF can effectively impute missing data compared to deep learning methods and the improvement in classification of data imputed indicates that imputation can have denoising effects. Therefore, using an imputation algorithm on time series with missing data can, at the same time, offer denoising effects.
