Table of Contents
Fetching ...

Likelihood-free hypothesis testing

Patrik Róbert Gerber, Yury Polyanskiy

TL;DR

This work analyzes likelihood-free hypothesis testing (LFHT), where hypotheses are only accessible via simulators, and derives a minimax trade-off between the number of simulations $n$ and real samples $m$ governed by nonparametric complexity $n_{GoF}(\epsilon)$. It shows that LFHT can achieve constant-error testing without fully estimating $\mathbb P_0,\mathbb P_1$ as long as $m\gg 1/\epsilon^2$ and $n$ meets the corresponding GoF/TS benchmarks, with the product constraint $mn$ scaling as $n_{GoF}^2(\epsilon)$. The central tool is Ingster’s $L^2$-distance test adapted to the LFHT setting via a unified projection-based statistic $T_{LF}$, together with reductions to GoF and two-sample testing to characterize the full region of feasibility across regular distribution classes ($\mathcal{P}_\sf{H}, \mathcal{P}_\sf{G}, \mathcal{P}_\sf{Db}, \mathcal{P}_\sf{D}$), plus robustness and Hellinger extensions. The results reveal a deep interpolation between goodness-of-fit, two-sample testing, and density estimation, and expose a phase transition for discrete distributions, providing a blueprint for testing without full distribution learning in high-complexity, simulator-based settings.

Abstract

Consider the problem of binary hypothesis testing. Given $Z$ coming from either $\mathbb P^{\otimes m}$ or $\mathbb Q^{\otimes m}$, to decide between the two with small probability of error it is sufficient, and in many cases necessary, to have $m\asymp1/ε^2$, where $ε$ measures the separation between $\mathbb P$ and $\mathbb Q$ in total variation ($\mathsf{TV}$). Achieving this, however, requires complete knowledge of the distributions and can be done, for example, using the Neyman-Pearson test. In this paper we consider a variation of the problem which we call likelihood-free hypothesis testing, where access to $\mathbb P$ and $\mathbb Q$ is given through $n$ i.i.d. observations from each. In the case when $\mathbb P$ and $\mathbb Q$ are assumed to belong to a non-parametric family, we demonstrate the existence of a fundamental trade-off between $n$ and $m$ given by $nm\asymp n_\sf{GoF}^2(ε)$, where $n_\sf{GoF}(ε)$ is the minimax sample complexity of testing between the hypotheses $H_0:\, \mathbb P=\mathbb Q$ vs $H_1:\, \mathsf{TV}(\mathbb P,\mathbb Q)\geqε$. We show this for three families of distributions, in addition to the family of all discrete distributions for which we obtain a more complicated trade-off exhibiting an additional phase-transition. Our results demonstrate the possibility of testing without fully estimating $\mathbb P$ and $\mathbb Q$, provided $m \gg 1/ε^2$.

Likelihood-free hypothesis testing

TL;DR

This work analyzes likelihood-free hypothesis testing (LFHT), where hypotheses are only accessible via simulators, and derives a minimax trade-off between the number of simulations and real samples governed by nonparametric complexity . It shows that LFHT can achieve constant-error testing without fully estimating as long as and meets the corresponding GoF/TS benchmarks, with the product constraint scaling as . The central tool is Ingster’s -distance test adapted to the LFHT setting via a unified projection-based statistic , together with reductions to GoF and two-sample testing to characterize the full region of feasibility across regular distribution classes (), plus robustness and Hellinger extensions. The results reveal a deep interpolation between goodness-of-fit, two-sample testing, and density estimation, and expose a phase transition for discrete distributions, providing a blueprint for testing without full distribution learning in high-complexity, simulator-based settings.

Abstract

Consider the problem of binary hypothesis testing. Given coming from either or , to decide between the two with small probability of error it is sufficient, and in many cases necessary, to have , where measures the separation between and in total variation (). Achieving this, however, requires complete knowledge of the distributions and can be done, for example, using the Neyman-Pearson test. In this paper we consider a variation of the problem which we call likelihood-free hypothesis testing, where access to and is given through i.i.d. observations from each. In the case when and are assumed to belong to a non-parametric family, we demonstrate the existence of a fundamental trade-off between and given by , where is the minimax sample complexity of testing between the hypotheses vs . We show this for three families of distributions, in addition to the family of all discrete distributions for which we obtain a more complicated trade-off exhibiting an additional phase-transition. Our results demonstrate the possibility of testing without fully estimating and , provided .
Paper Structure (48 sections, 33 theorems, 208 equations, 1 figure, 2 tables)

This paper contains 48 sections, 33 theorems, 208 equations, 1 figure, 2 tables.

Key Result

Lemma 1

For all $\eps$ and $\mathcal{P}$ with $|\mathcal{P}|\geq2$, the relation holds, where the implied constant is universal.

Figures (1)

  • Figure 1: Light and dark gray show $\mathcal{R}_\sf{LF}$ and its complement resp. on $\log$ scale; the striped region depicts $\mathcal{R}_\sf{TS} \subsetneq \mathcal{R}_\sf{LF}$. Left plot is valid for $\mathcal{P} \in \{\mathcal{P}_\sf{H}, \mathcal{P}_\sf{G}, \mathcal{P}_\sf{Db}\}$ for all settings of $\eps, k$. For $\mathcal{P}_\sf{D}$ the left plot applies when $k \lesssim \eps^{-4}$ and the right plot otherwise.

Theorems & Definitions (68)

  • Remark 1
  • Remark 2
  • Definition 1
  • Lemma 1
  • proof
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Remark 3
  • ...and 58 more