Table of Contents
Fetching ...

Minimax Optimal Estimation of Stability Under Distribution Shift

Hongseok Namkoong, Yuanzhe Ma, Peter W. Glynn

TL;DR

The authors consider an estimator based on the dual formulation of the stability measure and show that this estimator is minimax optimal, and empirically observe that the stability measure reliably captures system performance under distribution shift in applications including queueing systems and healthcare prediction tasks.

Abstract

The performance of decision policies and prediction models often deteriorates when applied to environments different from the ones seen during training. To ensure reliable operation, we analyze the stability of a system under distribution shift, which is defined as the smallest change in the underlying environment that causes the system's performance to deteriorate beyond a permissible threshold. In contrast to standard tail risk measures and distributionally robust losses that require the specification of a plausible magnitude of distribution shift, the stability measure is defined in terms of a more intuitive quantity: the level of acceptable performance degradation. We develop a minimax optimal estimator of stability and analyze its convergence rate, which exhibits a fundamental phase shift behavior. Our characterization of the minimax convergence rate shows that evaluating stability against large performance degradation incurs a statistical cost. Empirically, we demonstrate the practical utility of our stability framework by using it to compare system designs on problems where robustness to distribution shift is critical.

Minimax Optimal Estimation of Stability Under Distribution Shift

TL;DR

The authors consider an estimator based on the dual formulation of the stability measure and show that this estimator is minimax optimal, and empirically observe that the stability measure reliably captures system performance under distribution shift in applications including queueing systems and healthcare prediction tasks.

Abstract

The performance of decision policies and prediction models often deteriorates when applied to environments different from the ones seen during training. To ensure reliable operation, we analyze the stability of a system under distribution shift, which is defined as the smallest change in the underlying environment that causes the system's performance to deteriorate beyond a permissible threshold. In contrast to standard tail risk measures and distributionally robust losses that require the specification of a plausible magnitude of distribution shift, the stability measure is defined in terms of a more intuitive quantity: the level of acceptable performance degradation. We develop a minimax optimal estimator of stability and analyze its convergence rate, which exhibits a fundamental phase shift behavior. Our characterization of the minimax convergence rate shows that evaluating stability against large performance degradation incurs a statistical cost. Empirically, we demonstrate the practical utility of our stability framework by using it to compare system designs on problems where robustness to distribution shift is critical.
Paper Structure (40 sections, 23 theorems, 133 equations, 21 figures, 4 tables, 2 algorithms)

This paper contains 40 sections, 23 theorems, 133 equations, 21 figures, 4 tables, 2 algorithms.

Key Result

Lemma 1

Let $R \sim P$ be a real-valued random variable. For every $y$ with $\mathbb{E}_P[R] \le y < \mathrm{ess~sup}~R$, Furthermore, if the supremum on the right-hand side is attained at $\lambda^\star$, then $\lambda^\star \ge 0$ and the distribution $Q^\star$ that achieves the infimum on the left-hand side satisfies $dQ^\star(x) = \frac{e^{\lambda^\star x} dP(x)}{\mathbb{E}_P[e^{\lambda^\star R}]}, \

Figures (21)

  • Figure 1: Models with initially near-identical average-case performance exhibit different accuracy over time.
  • Figure 2: Mean squared error of $\widehat{I}_n$ (over 40 runs) with $P = \mathsf{Exp}(\sigma), \sigma \in \{0.9,1,1.1\}$ and a fixed threshold $y=2$.
  • Figure 3: Mean squared error of $\widehat{I}_n$ and $\Tilde{I}_n$ (over 40 runs) with $P = \mathsf{Exp}(\sigma), \sigma \in \{0.9,1,1.1\}$ and a fixed threshold $y=2$, here for $\Tilde{I}_n$ we choose the Gaussian kernel function with bandwith $h=0.1$
  • Figure 4: Performance of policies on i.i.d. test data
  • Figure 5: Mean squared error of $\widehat{I}_n$ (over 40 runs) for different policies with threshold $y= 3720.66$. Here for the KDE estimator $\Tilde{I}_n$\ref{['eqn:dual-kde']} we choose the Gaussian kernel function with bandwith $h = 100$.
  • ...and 16 more figures

Theorems & Definitions (27)

  • Lemma 1: DonskerVa76
  • Theorem 1
  • Definition 1
  • Theorem 2
  • Theorem 3
  • Definition 2
  • Theorem 4
  • Corollary 1
  • Definition 3
  • Theorem 5
  • ...and 17 more