Model Assessment and Selection under Temporal Distribution Shift

Elise Han; Chengpiao Huang; Kaizheng Wang

Model Assessment and Selection under Temporal Distribution Shift

Elise Han, Chengpiao Huang, Kaizheng Wang

TL;DR

The paper addresses assessing and selecting predictors under temporal distribution shift by introducing an adaptive rolling-window estimator for the current generalization error $L_t(f)$ and a framework for pairwise model comparisons. It then extends to multi-model selection via a single-elimination tournament, with oracle-type guarantees that adapt to unknown nonstationarity patterns. Theoretical analyses combined with experiments on synthetic and real data demonstrate the method's adaptivity, performing comparably to large fixed windows in stationary settings while outperforming small-window baselines during shifts. The work provides a practical offline toolkit for robust model evaluation and selection in evolving environments with historical data from past epochs.

Abstract

We investigate model assessment and selection in a changing environment, by synthesizing datasets from both the current time period and historical epochs. To tackle unknown and potentially arbitrary temporal distribution shift, we develop an adaptive rolling window approach to estimate the generalization error of a given model. This strategy also facilitates the comparison between any two candidate models by estimating the difference of their generalization errors. We further integrate pairwise comparisons into a single-elimination tournament, achieving near-optimal model selection from a collection of candidates. Theoretical analyses and numerical experiments demonstrate the adaptivity of our proposed methods to the non-stationarity in data.

Model Assessment and Selection under Temporal Distribution Shift

TL;DR

The paper addresses assessing and selecting predictors under temporal distribution shift by introducing an adaptive rolling-window estimator for the current generalization error

and a framework for pairwise model comparisons. It then extends to multi-model selection via a single-elimination tournament, with oracle-type guarantees that adapt to unknown nonstationarity patterns. Theoretical analyses combined with experiments on synthetic and real data demonstrate the method's adaptivity, performing comparably to large fixed windows in stationary settings while outperforming small-window baselines during shifts. The work provides a practical offline toolkit for robust model evaluation and selection in evolving environments with historical data from past epochs.

Abstract

Paper Structure (29 sections, 12 theorems, 76 equations, 6 figures, 4 tables, 4 algorithms)

This paper contains 29 sections, 12 theorems, 76 equations, 6 figures, 4 tables, 4 algorithms.

Introduction
Main contributions.
Related works.
Outline.
Notation.
Problem Setup
Model Assessment
Model Selection
Warmup: Model Comparison
Selection from Multiple Candidates
Numerical Experiments
Synthetic Data
Real Data: Topic Frequency Estimation
Real Data: House Price Prediction
Summary of Experiments
...and 14 more sections

Key Result

Lemma 3.1

Let $\{ x_i \}_{i=1}^n$ be independent random variables taking values in $[a, b]$ almost surely. Define the average variance $\sigma^2 = \frac{1}{n} \sum_{i=1}^n \mathop{\mathrm{\rm var}}\nolimits (x_i)$. For any $\delta \in (0 , 1 )$, with probability at least $1-\delta$,

Figures (6)

Figure 1: True means $\{ \mu_t \}_{t=0}^{100}$ in the synthetic data.
Figure 2: Excess risks of different model selection methods in \ref{['eg-syn-1']}. Left: $\sigma^2 = 1$. Right: $\sigma^2 = 10$. Red: $\mathcal{V}_{\rm ARW}$. Orange: $\mathcal{V}_1$. Blue: $\mathcal{V}_{256}$.
Figure 3: Excess risks of different model selection methods in \ref{['eg-syn-2']}. Left: $\sigma^2 = 1$. Right: $\sigma^2 = 10$. Red: $\mathcal{V}_{\rm ARW}$. Orange: $\mathcal{V}_1$. Blue: $\mathcal{V}_{256}$.
Figure 4: Error curves of different model selection methods on the arXiv data. Red: $\mathcal{V}_{\rm ARW}$. Orange: $\mathcal{V}_1$. Blue: $\mathcal{V}_{256}$.
Figure 5: Error curves of different model selection methods on the housing data. Red: $\mathcal{V}_{\rm ARW}$. Orange: $\mathcal{V}_1$. Blue: $\mathcal{V}_{256}$.
...and 1 more figures

Theorems & Definitions (20)

Lemma 3.1: Bernstein bound
Corollary 3.1
Lemma 3.2
Corollary 3.2
Lemma 3.3
Theorem 3.1: Oracle inequality
Example 3.1: Change point
Example 3.2: Bounded drift
Lemma 3.4
proof
...and 10 more

Model Assessment and Selection under Temporal Distribution Shift

TL;DR

Abstract

Model Assessment and Selection under Temporal Distribution Shift

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (20)