Testing For Distribution Shifts with Conditional Conformal Test Martingales

Shalev Shaer; Yarin Bar; Drew Prinster; Yaniv Romano

Testing For Distribution Shifts with Conditional Conformal Test Martingales

Shalev Shaer, Yarin Bar, Drew Prinster, Yaniv Romano

TL;DR

This work proposes a sequential test for detecting arbitrary distribution shifts that allows conformal test martingales to work under a fixed, reference-conditional setting, and produces a robust martingale construction that remains valid conditional on the null reference data.

Abstract

We propose a sequential test for detecting arbitrary distribution shifts that allows conformal test martingales (CTMs) to work under a fixed, reference-conditional setting. Existing CTM detectors construct test martingales by continually growing a reference set with each incoming sample, using it to assess how atypical the new sample is relative to past observations. While this design yields anytime-valid type-I error control, it suffers from test-time contamination: after a change, post-shift observations enter the reference set and dilute the evidence for distribution shift, increasing detection delay and reducing power. In contrast, our method avoids contamination by design by comparing each new sample to a fixed null reference dataset. Our main technical contribution is a robust martingale construction that remains valid conditional on the null reference data, achieved by explicitly accounting for the estimation error in the reference distribution induced by the finite reference set. This yields anytime-valid type-I error control together with guarantees of asymptotic power one and bounded expected detection delay. Empirically, our method detects shifts faster than standard CTMs, providing a powerful and reliable distribution-shift detector.

Testing For Distribution Shifts with Conditional Conformal Test Martingales

TL;DR

Abstract

Paper Structure (27 sections, 8 theorems, 79 equations, 6 figures, 2 algorithms)

This paper contains 27 sections, 8 theorems, 79 equations, 6 figures, 2 algorithms.

Introduction
Background and Related Work
Testing by Betting with Test Martingales
Conformal Test Martingales
Additional Related Work
Proposed Method
Forming the Test
Online Learning of the Betting Parameter $\eta_t$
Power Analysis
Practical Consideration: Mitigating Wealth Decay
Experiments
Synthetic Experiments
Experiments on ImageNet-C
Discussion
Experimental Details for Figure \ref{['fig:synth_pval_contamination']}
...and 12 more sections

Key Result

Theorem 3.1

Let $\mathcal{F}_{t-1} := \sigma(\hat{p}_1,...,\hat{p}_{t-1})$ be the filtration generated by $\hat{p}_1,...,\hat{p}_{t-1}$. Given an ECDF $\hat{F}_0$ of the null distribution $P$, estimated with the reference set $D_0$, and corresponding confidence bounds $\epsilon(\cdot)$ that satisfy eq:CI for a

Figures (6)

Figure 1: Running mean of conformal $p$-values over time. Comparison of conformal $p$-values (mean over 10 values) under a change-point introduced at timestep $300$. A $p$-value near $0.5$ indicates inability to detect distributional change; thus, the decay of $p$-values toward $0.5$ reflects increasing contamination of $D_t$ over time. Exact details are provided in Appendix \ref{['appdx:synth_pval']}.
Figure 2: Type-I error control under finite reference samples. (left): Comparison of the Type-I error rate, evaluated over 100 repetitions, between an invalid CTM (purple) and our proposed method (blue) for varying reference set sizes $n$. The dashed line represents the test level $\alpha=0.05$. Power under an alternative. Cumulative power over time, evaluated over 100 repetitions, for detecting an immediate shift (at time step $0$) from $\mathcal{N}(0,1)$ to $\mathcal{N}(1,1)$, with reference size $n=2000$. Our method (blue) achieves higher power and faster detection compared to the CTM (red). Exact details are provided in Appendix \ref{['appdx:fig_1']}.
Figure 3: Empirical power of the standard CTM and our conditional CTM in three scenarios. In each scenario, the empirical power is evaluated over 100 trials. Left: an immediate change-point case with different shift magnitudes. Middle: a delayed change-point scenario with different shift delays. Right: gradual shifts of different rates.
Figure 4: Empirical power across reference-set sizes on ImageNet-C.Left: power, evaluated on all 15 corruptions, as a function of the time steps. Right: median of the ratios of rejection times across different corruption groups as a function of $|D_0|=n$. Values above $1$ indicate faster detection by conditional CTM. Shaded regions denote standard error on 10 realizations.
Figure 5: Ablation study on the effect of the clipping parameter $C$. Empirical power, evaluated over 100 repetitions, of our proposed method for different values of the clipping parameter $C$. Left: immediate change-point scenario. Right: a delayed change-point scenario, with a delay of $200$ samples. The results are presented after the delay occurs.
...and 1 more figures

Theorems & Definitions (15)

Theorem 3.1
Lemma 3.2
Theorem 3.3
Lemma 4.1: Adapted from dai2025individualchen2025online
Lemma 4.2
proof
Lemma 5.1
proof
proof
Lemma 5.2: hazan2016introduction
...and 5 more

Testing For Distribution Shifts with Conditional Conformal Test Martingales

TL;DR

Abstract

Testing For Distribution Shifts with Conditional Conformal Test Martingales

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (15)