Table of Contents
Fetching ...

RG-TTA: Regime-Guided Meta-Control for Test-Time Adaptation in Streaming Time Series

Indar Kumar, Akanksha Tiwari, Sai Krishna Jasti, Ankit Hemant Lade

Abstract

Test-time adaptation (TTA) enables neural forecasters to adapt to distribution shifts in streaming time series, but existing methods apply the same adaptation intensity regardless of the nature of the shift. We propose Regime-Guided Test-Time Adaptation (RG-TTA), a meta-controller that continuously modulates adaptation intensity based on distributional similarity to previously-seen regimes. Using an ensemble of Kolmogorov-Smirnov, Wasserstein-1, feature-distance, and variance-ratio metrics, RG-TTA computes a similarity score for each incoming batch and uses it to (i) smoothly scale the learning rate -- more aggressive for novel distributions, conservative for familiar ones -- and (ii) control gradient effort via loss-driven early stopping rather than fixed budgets, allowing the system to allocate exactly the effort each batch requires. As a supplementary mechanism, RG-TTA gates checkpoint reuse from a regime memory, loading stored specialist models only when they demonstrably outperform the current model (loss improvement >= 30%). RG-TTA is model-agnostic and strategy-composable: it wraps any forecaster exposing train/predict/save/load interfaces and enhances any gradient-based TTA method. We demonstrate three compositions -- RG-TTA, RG-EWC, and RG-DynaTTA -- and evaluate 6 update policies (3 baselines + 3 regime-guided variants) across 4 compact architectures (GRU, iTransformer, PatchTST, DLinear), 14 datasets (6 real-world multivariate benchmarks + 8 synthetic regime scenarios), and 4 forecast horizons (96, 192, 336, 720) under a streaming evaluation protocol with 3 random seeds (672 experiments total). Regime-guided policies achieve the lowest MSE in 156 of 224 seed-averaged experiments (69.6%), with RG-EWC winning 30.4% and RG-TTA winning 29.0%. Overall, RG-TTA reduces MSE by 5.7% vs TTA while running 5.5% faster; RG-EWC reduces MSE by 14.1% vs standalone EWC.

RG-TTA: Regime-Guided Meta-Control for Test-Time Adaptation in Streaming Time Series

Abstract

Test-time adaptation (TTA) enables neural forecasters to adapt to distribution shifts in streaming time series, but existing methods apply the same adaptation intensity regardless of the nature of the shift. We propose Regime-Guided Test-Time Adaptation (RG-TTA), a meta-controller that continuously modulates adaptation intensity based on distributional similarity to previously-seen regimes. Using an ensemble of Kolmogorov-Smirnov, Wasserstein-1, feature-distance, and variance-ratio metrics, RG-TTA computes a similarity score for each incoming batch and uses it to (i) smoothly scale the learning rate -- more aggressive for novel distributions, conservative for familiar ones -- and (ii) control gradient effort via loss-driven early stopping rather than fixed budgets, allowing the system to allocate exactly the effort each batch requires. As a supplementary mechanism, RG-TTA gates checkpoint reuse from a regime memory, loading stored specialist models only when they demonstrably outperform the current model (loss improvement >= 30%). RG-TTA is model-agnostic and strategy-composable: it wraps any forecaster exposing train/predict/save/load interfaces and enhances any gradient-based TTA method. We demonstrate three compositions -- RG-TTA, RG-EWC, and RG-DynaTTA -- and evaluate 6 update policies (3 baselines + 3 regime-guided variants) across 4 compact architectures (GRU, iTransformer, PatchTST, DLinear), 14 datasets (6 real-world multivariate benchmarks + 8 synthetic regime scenarios), and 4 forecast horizons (96, 192, 336, 720) under a streaming evaluation protocol with 3 random seeds (672 experiments total). Regime-guided policies achieve the lowest MSE in 156 of 224 seed-averaged experiments (69.6%), with RG-EWC winning 30.4% and RG-TTA winning 29.0%. Overall, RG-TTA reduces MSE by 5.7% vs TTA while running 5.5% faster; RG-EWC reduces MSE by 14.1% vs standalone EWC.

Paper Structure

This paper contains 86 sections, 7 theorems, 20 equations, 8 figures, 13 tables, 1 algorithm.

Key Result

Theorem 1

Assume the per-batch loss $\mathcal{L}(\phi) = \mathbb{E}_{B_t}[\ell(h_\phi(g_\psi(x)), y)]$ is $\mu$-strongly convex and $L$-smooth in $\phi$ with condition number $\kappa = L/\mu$. After $K$ gradient descent steps with learning rate $\alpha \in (0, 2/L)$ from initialisation $\phi_0$: The expected per-batch MSE decomposes as: where $\sigma^2_t = \mathbb{E}_{P_t}[\|y - h_{\phi^*_t}(g_\psi(x))\|^

Figures (8)

  • Figure 1: RG-TTA system overview. For each incoming batch $B_t$, the system extracts a distributional feature vector $\mathbf{r}$ (Eq. \ref{['eq:features']}) and computes a similarity score $\text{sim}$ against stored regimes using a four-metric ensemble (Eq. \ref{['eq:ensemble']}). If a high-similarity checkpoint passes the loss gate ($\ell_{\text{ckpt}} < 0.70\,\ell_{\text{curr}}$, Eq. \ref{['eq:gate']}), it replaces the current model. The learning rate is smoothly modulated by similarity (Eq. \ref{['eq:lr']}), and the output head is adapted with loss-driven early stopping (max 25 steps, patience 3). The adapted checkpoint and its regime features are stored in the memory $\mathcal{M}$ (dashed box; 5 slots, FIFO eviction) for future reuse. See Algorithm \ref{['alg:rgtta']} for pseudocode.
  • Figure 2: Pair-wise MSE change from adding regime-guidance. Negative values indicate improvement. All three pairs are statistically significant (Wilcoxon, Bonferroni-corrected).
  • Figure 3: Real-world MSE by model architecture and policy. RG-TTA and RG-EWC dominate on GRU-Small and iTransformer; DLinear benefits most from RG-DynaTTA.
  • Figure 4: Demšar-style critical difference diagram (Nemenyi, $\alpha=0.05$). Connected policies are not significantly different. RG-TTA and RG-EWC rank best (2.46, 2.51).
  • Figure 5: Per-dataset win rate (%) by policy. Real-world datasets (top) and synthetic (bottom) show similar patterns: RG-TTA and RG-EWC dominate most datasets.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Definition 1: Batch-optimal parameters
  • Theorem 1: Adaptation error bound
  • Corollary 1: Step savings from checkpoint reuse
  • Theorem 2: Generalisation bound for frozen-backbone adaptation
  • Proposition 1: Metric properties
  • Proposition 2: Sufficient condition for beneficial checkpoint loading
  • Proposition 3: Specialist advantage
  • Proposition 4: Convergence advantage under regime reuse