Table of Contents
Fetching ...

Visual tests using several safe confidence intervals

Timothée Mathieu

TL;DR

This paper develops a principled visual framework for two-sample mean comparison by constructing confidence intervals from e-variables and testing overlap between the intervals. It provides both fixed-time and anytime (sequential) tests with nonparametric, finite-sample type I–III error guarantees under bounded-support assumptions, leveraging the betting/e-value framework and Ville-type martingale bounds. The key contributions include (i) a concrete construction of $C_n(\alpha;X,W)$ via $E_n$, (ii) fixed-time and sequential overlap tests with explicit weight calibration and error bounds, (iii) analysis of interval length and non-intersection probabilities, and (iv) practical demonstrations in simulated data and in comparing sequential learning algorithms. The results offer a safe, interpretable visual alternative for practitioners to assess whether two population means differ, with rigorous nonparametric guarantees and applicability to sequential data streams in ML contexts.

Abstract

We propose a new statistical hypothesis testing framework which decides visually, using confidence intervals, whether the means of two samples are equal or if one is larger than the other. With our method, the user can at the same time visualize the confidence region of the means and do a test to decide if the means of the two populations are significantly different or not by looking whether the two confidence intervals overlap. To design this test we use confidence intervals constructed using e-variables, which provide a measure of evidence in hypothesis testing. We propose both a sequential test and a non-sequential test based on the overlap of confidence intervals and for each of these tests we give finite-time error bounds on the probabilities of error. We also illustrate the practicality of our method by applying it to the comparison of sequential learning algorithms.

Visual tests using several safe confidence intervals

TL;DR

This paper develops a principled visual framework for two-sample mean comparison by constructing confidence intervals from e-variables and testing overlap between the intervals. It provides both fixed-time and anytime (sequential) tests with nonparametric, finite-sample type I–III error guarantees under bounded-support assumptions, leveraging the betting/e-value framework and Ville-type martingale bounds. The key contributions include (i) a concrete construction of via , (ii) fixed-time and sequential overlap tests with explicit weight calibration and error bounds, (iii) analysis of interval length and non-intersection probabilities, and (iv) practical demonstrations in simulated data and in comparing sequential learning algorithms. The results offer a safe, interpretable visual alternative for practitioners to assess whether two population means differ, with rigorous nonparametric guarantees and applicability to sequential data streams in ML contexts.

Abstract

We propose a new statistical hypothesis testing framework which decides visually, using confidence intervals, whether the means of two samples are equal or if one is larger than the other. With our method, the user can at the same time visualize the confidence region of the means and do a test to decide if the means of the two populations are significantly different or not by looking whether the two confidence intervals overlap. To design this test we use confidence intervals constructed using e-variables, which provide a measure of evidence in hypothesis testing. We propose both a sequential test and a non-sequential test based on the overlap of confidence intervals and for each of these tests we give finite-time error bounds on the probabilities of error. We also illustrate the practicality of our method by applying it to the comparison of sequential learning algorithms.

Paper Structure

This paper contains 31 sections, 9 theorems, 72 equations, 11 figures.

Key Result

Lemma 4

Let $w>0$ and $n \in \mathbb{N}$, suppose that for all $1\le t\le n$, we have $W_t(X)=w$, and define $v = w^2/(1-w(b_P-a_P))$. Suppose that where $\widehat{\sigma}_n^2 = \frac{1}{n}\sum_{i=1}^n(X_i - \frac{1}{n}\sum_{i=1}^nX_i)^2$ is the empirical variance. Then the length $L(C_n(\alpha;X,W))$ of the confidence interval $C_n(\alpha;X,W)$ satisfies

Figures (11)

  • Figure 1: Approximate probabilities of error of the overlap test for $n$ sufficiently large.
  • Figure 2: Probabilities of obtaining each decision and mean sample size at decision (with std in parentheses) for the Anytime overlap test. Theoretical bounds on errors are: $\mathrm{type\ I} \le 0.052$, $\mathrm{type\ II}\le 0.2$, $\mathrm{type\ III}\le 0.014$
  • Figure 3: Probabilities of obtaining each decision. Fixed time, with 1000 samples each. Theoretical bounds on errors are: $\mathrm{type\ I} \le 0.04$, $\mathrm{type\ II}\le 1$, $\mathrm{type\ III}\le 0.01$
  • Figure 4: Confidence intervals for the comparison of Deep RL (Figure \ref{['fig:ale']}) and Bandit (Figure \ref{['fig:bandits']}) algorithms. The stars represent the empirical means and the boxes are the confidence intervals.
  • Figure 5: Event 1
  • ...and 6 more figures

Theorems & Definitions (14)

  • Remark 1: Notations
  • Remark 2: Burn-in period
  • Remark 3: directional versus bilateral test
  • Lemma 4: Length of confidence interval -- constant weights
  • Remark 5: Hoeffding-type weights
  • Remark 6: Tuning the constant $c$
  • Lemma 7
  • Theorem 8: Type I errors
  • Theorem 9: Anytime bound on type II error
  • Theorem 10: Fixed time bound on type II error
  • ...and 4 more