Table of Contents
Fetching ...

The Stationarity Bias: Stratified Stress-Testing for Time-Series Imputation in Regulated Dynamical Systems

Amirreza Dolatpour Fathkouhi, Alireza Namazi, Heman Shakeri

TL;DR

This work formalizes the Stationarity Bias in time-series imputation benchmarks by showing how uniform random masking over predominantly stationary data inflates simple baselines' performance. It introduces a Stratified Stress-Test that partitions evaluation into Stationary and Transient regimes and demonstrates, across CGM datasets, that linear interpolation excels in stationary periods while deep-learning models are necessary to preserve morphological fidelity during critical transients. The study reveals the RMSE Mirage, where low pointwise error masks substantial shape distortion, and introduces distributional calibration analyses to expose miscalibration in imputed trajectories. It further proposes an adaptive inference framework that routes missing segments to the appropriate method based on gradient stability, balancing safety and efficiency for regulated dynamical systems. The findings have practical implications beyond CGM, generalizing to any system where routine stationarity dominates critical transient behavior, and advocate regime-stratified evaluation as a standard benchmarking practice.

Abstract

Time-series imputation benchmarks employ uniform random masking and shape-agnostic metrics (MSE, RMSE), implicitly weighting evaluation by regime prevalence. In systems with a dominant attractor -- homeostatic physiology, nominal industrial operation, stable network traffic -- this creates a systematic \emph{Stationarity Bias}: simple methods appear superior because the benchmark predominantly samples the easy, low-entropy regime where they trivially succeed. We formalize this bias and propose a \emph{Stratified Stress-Test} that partitions evaluation into Stationary and Transient regimes. Using Continuous Glucose Monitoring (CGM) as a testbed -- chosen for its rigorous ground-truth forcing functions (meals, insulin) that enable precise regime identification -- we establish three findings with broad implications:(i)~Stationary Efficiency: Linear interpolation achieves state-of-the-art reconstruction during stable intervals, confirming that complex architectures are computationally wasteful in low-entropy regimes.(ii)~Transient Fidelity: During critical transients (post-prandial peaks, hypoglycemic events), linear methods exhibit drastically degraded morphological fidelity (DTW), disproportionate to their RMSE -- a phenomenon we term the \emph{RMSE Mirage}, where low pointwise error masks the destruction of signal shape.(iii)~Regime-Conditional Model Selection: Deep learning models preserve both pointwise accuracy and morphological integrity during transients, making them essential for safety-critical downstream tasks. We further derive empirical missingness distributions from clinical trials and impose them on complete training data, preventing models from exploiting unrealistically clean observations and encouraging robustness under real-world missingness. This framework generalizes to any regulated system where routine stationarity dominates critical transients.

The Stationarity Bias: Stratified Stress-Testing for Time-Series Imputation in Regulated Dynamical Systems

TL;DR

This work formalizes the Stationarity Bias in time-series imputation benchmarks by showing how uniform random masking over predominantly stationary data inflates simple baselines' performance. It introduces a Stratified Stress-Test that partitions evaluation into Stationary and Transient regimes and demonstrates, across CGM datasets, that linear interpolation excels in stationary periods while deep-learning models are necessary to preserve morphological fidelity during critical transients. The study reveals the RMSE Mirage, where low pointwise error masks substantial shape distortion, and introduces distributional calibration analyses to expose miscalibration in imputed trajectories. It further proposes an adaptive inference framework that routes missing segments to the appropriate method based on gradient stability, balancing safety and efficiency for regulated dynamical systems. The findings have practical implications beyond CGM, generalizing to any system where routine stationarity dominates critical transient behavior, and advocate regime-stratified evaluation as a standard benchmarking practice.

Abstract

Time-series imputation benchmarks employ uniform random masking and shape-agnostic metrics (MSE, RMSE), implicitly weighting evaluation by regime prevalence. In systems with a dominant attractor -- homeostatic physiology, nominal industrial operation, stable network traffic -- this creates a systematic \emph{Stationarity Bias}: simple methods appear superior because the benchmark predominantly samples the easy, low-entropy regime where they trivially succeed. We formalize this bias and propose a \emph{Stratified Stress-Test} that partitions evaluation into Stationary and Transient regimes. Using Continuous Glucose Monitoring (CGM) as a testbed -- chosen for its rigorous ground-truth forcing functions (meals, insulin) that enable precise regime identification -- we establish three findings with broad implications:(i)~Stationary Efficiency: Linear interpolation achieves state-of-the-art reconstruction during stable intervals, confirming that complex architectures are computationally wasteful in low-entropy regimes.(ii)~Transient Fidelity: During critical transients (post-prandial peaks, hypoglycemic events), linear methods exhibit drastically degraded morphological fidelity (DTW), disproportionate to their RMSE -- a phenomenon we term the \emph{RMSE Mirage}, where low pointwise error masks the destruction of signal shape.(iii)~Regime-Conditional Model Selection: Deep learning models preserve both pointwise accuracy and morphological integrity during transients, making them essential for safety-critical downstream tasks. We further derive empirical missingness distributions from clinical trials and impose them on complete training data, preventing models from exploiting unrealistically clean observations and encouraging robustness under real-world missingness. This framework generalizes to any regulated system where routine stationarity dominates critical transients.
Paper Structure (45 sections, 9 equations, 15 figures, 3 tables, 1 algorithm)

This paper contains 45 sections, 9 equations, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: Conceptual illustration. (a) The Stationarity Bias: random masking disproportionately samples the dominant stationary regime, inflating baseline performance. (b) The RMSE Mirage: linear interpolation "cuts the corner" of a transient peak, achieving comparable RMSE to a deep model while destroying signal morphology (DTW).
  • Figure 2: Imputation performance during (a) homeostatic periods, (b) post-prandial excursions, and (c) hypoglycemia during temporal controller resets. In (a), linear interpolation is superior due to signal stability. In contrast, for (b) and (c), deep learning models demonstrate superior morphological fidelity.
  • Figure 3: Distributional calibration analysis. Each panel shows the conditional density of imputed values (shaded) against the ground-truth density (dashed black) during transient regimes. Vertical lines indicate distribution means.
  • Figure 4: Hourly probability of a missingness gap beginning during the day.
  • Figure 5: Empirical cumulative distribution function (CDF) of missingness durations. Comparison between day and night regimes.
  • ...and 10 more figures