Table of Contents
Fetching ...

How Deep is your Guess? A Fresh Perspective on Deep Learning for Medical Time-Series Imputation

Linglong Qian, Tao Wang, Jun Wang, Hugh Logan Ellis, Robin Mitra, Richard Dobson, Zina Ibrahim

TL;DR

This paper analyzes deep imputers for EHR time-series, showing that the interplay of architectural and generative biases strongly shapes imputation performance and that larger models do not guarantee better results. It provides a theoretically grounded taxonomy of imputers, and an open, controlled benchmarking study using PyPOTS across eight models on PhysioNet 2012, revealing that design choices and alignment with EHR characteristics drive performance as much as, or more than, model size. The work highlights critical gaps in evaluation practices, especially masking strategies and uncertainty quantification, and identifies open questions at the interface of clinical domain knowledge and deep learning. The findings advocate for standardized, data-driven benchmarking and for integrating clinical insights to develop more reliable and clinically meaningful imputation methods for healthcare applications.

Abstract

We present a comprehensive analysis of deep learning approaches for Electronic Health Record (EHR) time-series imputation, examining how architectural and framework biases combine to influence model performance. Our investigation reveals varying capabilities of deep imputers in capturing complex spatiotemporal dependencies within EHRs, and that model effectiveness depends on how its combined biases align with medical time-series characteristics. Our experimental evaluation challenges common assumptions about model complexity, demonstrating that larger models do not necessarily improve performance. Rather, carefully designed architectures can better capture the complex patterns inherent in clinical data. The study highlights the need for imputation approaches that prioritise clinically meaningful data reconstruction over statistical accuracy. Our experiments show imputation performance variations of up to 20\% based on preprocessing and implementation choices, emphasising the need for standardised benchmarking methodologies. Finally, we identify critical gaps between current deep imputation methods and medical requirements, highlighting the importance of integrating clinical insights to achieve more reliable imputation approaches for healthcare applications.

How Deep is your Guess? A Fresh Perspective on Deep Learning for Medical Time-Series Imputation

TL;DR

This paper analyzes deep imputers for EHR time-series, showing that the interplay of architectural and generative biases strongly shapes imputation performance and that larger models do not guarantee better results. It provides a theoretically grounded taxonomy of imputers, and an open, controlled benchmarking study using PyPOTS across eight models on PhysioNet 2012, revealing that design choices and alignment with EHR characteristics drive performance as much as, or more than, model size. The work highlights critical gaps in evaluation practices, especially masking strategies and uncertainty quantification, and identifies open questions at the interface of clinical domain knowledge and deep learning. The findings advocate for standardized, data-driven benchmarking and for integrating clinical insights to develop more reliable and clinically meaningful imputation methods for healthcare applications.

Abstract

We present a comprehensive analysis of deep learning approaches for Electronic Health Record (EHR) time-series imputation, examining how architectural and framework biases combine to influence model performance. Our investigation reveals varying capabilities of deep imputers in capturing complex spatiotemporal dependencies within EHRs, and that model effectiveness depends on how its combined biases align with medical time-series characteristics. Our experimental evaluation challenges common assumptions about model complexity, demonstrating that larger models do not necessarily improve performance. Rather, carefully designed architectures can better capture the complex patterns inherent in clinical data. The study highlights the need for imputation approaches that prioritise clinically meaningful data reconstruction over statistical accuracy. Our experiments show imputation performance variations of up to 20\% based on preprocessing and implementation choices, emphasising the need for standardised benchmarking methodologies. Finally, we identify critical gaps between current deep imputation methods and medical requirements, highlighting the importance of integrating clinical insights to achieve more reliable imputation approaches for healthcare applications.
Paper Structure (28 sections, 5 figures, 5 tables)

This paper contains 28 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A conceptual overview of generative frameworks used in medical time-series imputation.
  • Figure 2: The hierarchy of dimensions of inductive bias governing the behaviour of deep imputation models
  • Figure 3: Masking techniques and approaches demonstrated over a time-series of five features ($x_1 - x_5$) and five time points ($t_1 - t_5$): (a) random masking, (b) temporal masking, (c) spatial masking, (d) block masking. The yellow cells indicate those labeled as missing via masking. In (e) augmentation and (f) overlaying, the blue cells indicate cells that are missing within the original data. In (e), the masked (yellow) cells have no overlap with the original missingness in the data. Green: masked data coming from both the original missingness and artificial missingness. In (f), overlaying masks cells from either the original missingness or simulates artificial missingness from non-missing data.
  • Figure 4: Perforamnce Efficiency of the eight models.
  • Figure 5: The Effect of different masking strategies on model performance measured in MAE.