How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?

Odalric-Ambrym Maillard; Mohammad Sadegh Talebi

How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?

Odalric-Ambrym Maillard, Mohammad Sadegh Talebi

TL;DR

This work studies a set of $K$ discrete distributions $(p_k)_{k\in\mathcal K}$ over a common alphabet that are permutation-equivalent through an unknown canonical $q$ and permutations, and aims to tighten individual confidence sets by exploiting this structure. It introduces a low-complexity algorithm to identify compatible matchings among distributions and construct refined confidence intervals that remain valid while using all samples, supported by finite-time high-probability bounds. The analysis shows that refined confidence sets shrink at rates $O\big(1/\sqrt{\sum_k n_k}\big)$ for points in the support and $O\big(1/\max_k n_k\big)$ outside the support, improving over naive per-distribution estimation and enabling significant gains when $\mathcal K$ is large. The approach is instantiated with surrogate confidence intervals (e.g., KL, Bernstein, empirical Bernstein) and demonstrated on reinforcement learning tasks (RiverSwim), where exploiting permutation-equivalence yields notably tighter estimates and lower regret compared to baselines. The results offer practical guidance on when refinement pays off (in terms of data and structure) and how to balance statistical gains with computational considerations, while outlining extensions to broader automorphism families and applications.

Abstract

We consider the situation when a learner faces a set of unknown discrete distributions $(p_k)_{k\in \mathcal K}$ defined over a common alphabet $\mathcal X$, and can build for each distribution $p_k$ an individual high-probability confidence set thanks to $n_k$ observations sampled from $p_k$. The set $(p_k)_{k\in \mathcal K}$ is structured: each distribution $p_k$ is obtained from the same common, but unknown, distribution q via applying an unknown permutation to $\mathcal X$. We call this \emph{permutation-equivalence}. The goal is to build refined confidence sets \emph{exploiting} this structural property. Like other popular notions of structure (Lipschitz smoothness, Linearity, etc.) permutation-equivalence naturally appears in machine learning problems, and to benefit from its potential gain calls for a specific approach. We present a strategy to effectively exploit permutation-equivalence, and provide a finite-time high-probability bound on the size of the refined confidence sets output by the strategy. Since a refinement is not possible for too few observations in general, under mild technical assumptions, our finite-time analysis establish when the number of observations $(n_k)_{k\in \mathcal K}$ are large enough so that the output confidence sets improve over initial individual sets. We carefully characterize this event and the corresponding improvement. Further, our result implies that the size of confidence sets shrink at asymptotic rates of $O(1/\sqrt{\sum_{k\in \mathcal K} n_k})$ and $O(1/\max_{k\in K} n_{k})$, respectively for elements inside and outside the support of q, when the size of each individual confidence set shrinks at respective rates of $O(1/\sqrt{n_k})$ and $O(1/n_k)$. We illustrate the practical benefit of exploiting permutation equivalence on a reinforcement learning task.

How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?

TL;DR

This work studies a set of

discrete distributions

over a common alphabet that are permutation-equivalent through an unknown canonical

and permutations, and aims to tighten individual confidence sets by exploiting this structure. It introduces a low-complexity algorithm to identify compatible matchings among distributions and construct refined confidence intervals that remain valid while using all samples, supported by finite-time high-probability bounds. The analysis shows that refined confidence sets shrink at rates

for points in the support and

outside the support, improving over naive per-distribution estimation and enabling significant gains when

is large. The approach is instantiated with surrogate confidence intervals (e.g., KL, Bernstein, empirical Bernstein) and demonstrated on reinforcement learning tasks (RiverSwim), where exploiting permutation-equivalence yields notably tighter estimates and lower regret compared to baselines. The results offer practical guidance on when refinement pays off (in terms of data and structure) and how to balance statistical gains with computational considerations, while outlining extensions to broader automorphism families and applications.

Abstract

We consider the situation when a learner faces a set of unknown discrete distributions

defined over a common alphabet

, and can build for each distribution

an individual high-probability confidence set thanks to

observations sampled from

. The set

is structured: each distribution

is obtained from the same common, but unknown, distribution q via applying an unknown permutation to

. We call this \emph{permutation-equivalence}. The goal is to build refined confidence sets \emph{exploiting} this structural property. Like other popular notions of structure (Lipschitz smoothness, Linearity, etc.) permutation-equivalence naturally appears in machine learning problems, and to benefit from its potential gain calls for a specific approach. We present a strategy to effectively exploit permutation-equivalence, and provide a finite-time high-probability bound on the size of the refined confidence sets output by the strategy. Since a refinement is not possible for too few observations in general, under mild technical assumptions, our finite-time analysis establish when the number of observations

are large enough so that the output confidence sets improve over initial individual sets. We carefully characterize this event and the corresponding improvement. Further, our result implies that the size of confidence sets shrink at asymptotic rates of

and

, respectively for elements inside and outside the support of q, when the size of each individual confidence set shrinks at respective rates of

and

. We illustrate the practical benefit of exploiting permutation equivalence on a reinforcement learning task.

Paper Structure (36 sections, 6 theorems, 52 equations, 8 figures, 2 algorithms)

This paper contains 36 sections, 6 theorems, 52 equations, 8 figures, 2 algorithms.

Introduction
Related work: Permutation-invariance and learning permutation.
Outline and contribution.
Setup and Notations: Tightening Estimation using Equivalence
Empirical estimates and confidence sets.
Refined estimates and confidence intervals.
Warming-up: $K=2$.
Building Confidence Sets Exploiting Permutation Equivalence
Identification of Compatible Matchings.
Refined Concentration Sets.
Case 1: $I_{k,x,k'}$ is a singleton.
Case 2: $I_{k,x,k'}$ is not a singleton.
Numerical illustration of Algorithm .
The Statistical Benefit of Permutation Equivalence
Impact of $L$.
...and 21 more sections

Key Result

Theorem 1

Under Assumption and the event $\Omega$, it holds for all $k$, for all $x\!\in\! \mathcal{X}_{p_k}$ (points in the support of $p_k$),

Figures (8)

Figure 1: Left: two equivalent distributions (red, blue dots) and their Upper and Lower confidence bounds. Right: Refined confidence bounds exploiting equivalence.
Figure 2: Non-empty intersections between confidence intervals, and the resulting pruning.
Figure 3: The first experiment. Left: Initial confidence sets, generated from $n_0=1000, n_1=250$, and $n_2=250$ observations. Right: Confidence sets output by Algorithm exploiting $\mathbb{G}_\mathcal{X}$-equivalence.
Figure 4: The second experiment. Left: Initial confidence sets, generated from $n_0=1000, n_1=250$, and $n_2=250$ observations. Right: Confidence sets output by Algorithm exploiting $\mathbb{G}_\mathcal{X}$-equivalence.
Figure 5: Ratio between initial and refined (empirical Bernstein) confidence sets on problem instances with $|\mathcal{X}|=10$, $K=5$, as a function of $L$ for $N_1=200$ (left), and as a function of $N_1$ for $L=5$ (right). All values are averaged over $100$ independent experiments.
...and 3 more figures

Theorems & Definitions (11)

Definition 1: Permutation-equivalent set
Remark 1
Definition 2: Surrogate Confidence Intervals
Theorem 1: Concentration benefit for $x\in \mathcal{X}_{p_k}$
Theorem 2: Concentration benefit for $x\notin \mathcal{X}_{p_k}$
Remark 2: Asymptotic behavior
Lemma 1
Lemma 2
Remark 3
Lemma 3
...and 1 more

How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?

TL;DR

Abstract

How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (11)