Table of Contents
Fetching ...

Likelihood-Free Frequentist Inference: Bridging Classical Statistics and Machine Learning for Reliable Simulator-Based Inference

Niccolò Dalmasso, Luca Masserano, David Zhao, Rafael Izbicki, Ann B. Lee

TL;DR

This work proposes a modular inference framework that bridges classical statistics and modern machine learning to provide a practical approach for constructing confidence sets with near finite-sample validity at any value of the unknown parameters, and interpretable diagnostics for estimating empirical coverage across the entire parameter space.

Abstract

Many areas of science rely on simulators that implicitly encode intractable likelihood functions of complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, especially outside asymptotic and low-dimensional regimes. At the same time, popular LFI methods - such as Approximate Bayesian Computation or more recent machine learning techniques - do not necessarily lead to valid scientific inference because they do not guarantee confidence sets with nominal coverage in general settings. In addition, LFI currently lacks practical diagnostic tools to check the actual coverage of computed confidence sets across the entire parameter space. In this work, we propose a modular inference framework that bridges classical statistics and modern machine learning to provide (i) a practical approach for constructing confidence sets with near finite-sample validity at any value of the unknown parameters, and (ii) interpretable diagnostics for estimating empirical coverage across the entire parameter space. We refer to this framework as likelihood-free frequentist inference (LF2I). Any method that defines a test statistic can leverage LF2I to create valid confidence sets and diagnostics without costly Monte Carlo or bootstrap samples at fixed parameter settings. We study two likelihood-based test statistics (ACORE and BFF) and demonstrate their performance on high-dimensional complex data. Code is available at https://github.com/lee-group-cmu/lf2i.

Likelihood-Free Frequentist Inference: Bridging Classical Statistics and Machine Learning for Reliable Simulator-Based Inference

TL;DR

This work proposes a modular inference framework that bridges classical statistics and modern machine learning to provide a practical approach for constructing confidence sets with near finite-sample validity at any value of the unknown parameters, and interpretable diagnostics for estimating empirical coverage across the entire parameter space.

Abstract

Many areas of science rely on simulators that implicitly encode intractable likelihood functions of complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, especially outside asymptotic and low-dimensional regimes. At the same time, popular LFI methods - such as Approximate Bayesian Computation or more recent machine learning techniques - do not necessarily lead to valid scientific inference because they do not guarantee confidence sets with nominal coverage in general settings. In addition, LFI currently lacks practical diagnostic tools to check the actual coverage of computed confidence sets across the entire parameter space. In this work, we propose a modular inference framework that bridges classical statistics and modern machine learning to provide (i) a practical approach for constructing confidence sets with near finite-sample validity at any value of the unknown parameters, and (ii) interpretable diagnostics for estimating empirical coverage across the entire parameter space. We refer to this framework as likelihood-free frequentist inference (LF2I). Any method that defines a test statistic can leverage LF2I to create valid confidence sets and diagnostics without costly Monte Carlo or bootstrap samples at fixed parameter settings. We study two likelihood-based test statistics (ACORE and BFF) and demonstrate their performance on high-dimensional complex data. Code is available at https://github.com/lee-group-cmu/lf2i.

Paper Structure

This paper contains 43 sections, 16 theorems, 82 equations, 8 figures, 7 algorithms.

Key Result

Proposition 1

Assume that, for every $\theta \in \Theta$, $G$ dominates $\nu$. If $\ \widehat{{\mathbb P}}(Y=1|\theta,{\mathbf{x}})={\mathbb P}(Y=1|\theta,{\mathbf{x}})$ for every $\theta$ and ${\mathbf{x}}$, then $\widehat{\tau}({\mathcal{D}}; \Theta_0)$ is the Bayes factor $\text{BF}({\mathcal{D}}; \Theta_0)$.

Figures (8)

  • Figure 1: The three-branch fully modular framework for likelihood-free frequentist inference (LF2I).Center branch: Draw a sample ${\mathcal{T}}$ of size $B$ from the simulator to estimate an arbitrary test statistic $\lambda({\mathcal{D}};\theta)$. Here we show how to do so by estimating the likelihood via the odds function $\mathbb{O}({\mathbf{X}};\theta)$. Left branch: Draw a second sample ${\mathcal{T}}'$ of size $B'$ to estimate the critical values $C_{\theta}$ or p-values $p({\mathcal{D}}; \theta)$ for all $\theta \in \Theta$. Left $+$ Center: Once data $D$ are observed, we can construct confidence sets $\widehat{R}(D)$ with finite-$n$ validity according to Equation \ref{['eq:est_conf_set']}. Right branch: The LF2I diagnostics branch independently checks whether the coverage ${\mathbb P}_{{\mathcal{D}} | \theta} (\theta \in \widehat{R}({\mathcal{D}}))$ of the confidence set is indeed correct across the entire parameter space.
  • Figure 2: Neyman construction of confidence sets by inverting hypothesis tests.Left: For each $\theta_0 \in \Theta$, we find the critical value $C_{\theta_0}$ that rejects the null hypothesis $H_{0,\theta_0}$ at level $\alpha$; that is, $C_{\theta_0}$ is the $\alpha$-quantile of the distribution of the test statistic under the null (a likelihood ratio $\text{LR}(\mathcal{D}; \theta_0)$ in this case). Right: The horizontal solid lines represent acceptance regions for each $\theta_0 \in \Theta$. Suppose we observe data $D$. The confidence set for $\theta$ (red vertical solid line) consists of all $\theta_0$-values for which the observed test statistic $\text{LR}(D; \theta_0)$ (black curve) falls in the acceptance region.
  • Figure 3: GMM with unknown null distribution. Each panel shows the estimated coverage across the parameter space of 90% confidence sets for $\theta$. Rows represent experiments with different observed sample sizes: $n=10,100, 1000$ (top, center, bottom). Columns represent three different approaches. Left: "LR with Monte Carlo samples" achieves nominal coverage everywhere but is computationally expensive, especially in higher dimensions. Center: "Chi-square LRT" clearly under-covers, i.e. confidence sets are not valid even for large $n$, other than at $\theta=0$ where the mixture collapses to one Gaussian. Right: "LR with $C_{\theta_0}$ via quantile regression" returns finite-sample confidence sets with the nominal coverage of $90\%$ for all values of $\theta$, but using a total of 1000 simulations, instead of a MC sample of 1000 simulations at each grid point.
  • Figure 4: Poisson counting experiment with nuisance parameters. The diagnostics branch provides guidance as to which LFI approach to use for the problem at hand by pinpointing regions of the parameter space $\Theta$ where inference is unreliable. The panels show empirical coverage as a function of both $\mu$, the parameter of interest, and $\nu$, the nuisance parameter. Nominal coverage is $90\%$. Left: h-ACORE, which uses profiled likelihoods, is overly conservative in terms of actual coverage ($\approx 96\%$) across $\Theta$. Center: h-BFF, which marginalizes over $\nu$, under-covers in several regions (red crosses). Right: ACORE $\chi_1^2$, which uses cutoffs from the chi-square distribution, has almost no constraining power, yielding empirical coverage close to $100\%$ everywhere.
  • Figure 5: Constraining power. Relative size of the confidence sets constructed in Section \ref{['sec:hep_example']}. ACORE$\chi_1^2$ and h-ACORE yield the widest intervals (they are indeed overly conservative according to Figure \ref{['fig:onoff_coverage']}). h-BFF provides tighter confidence sets, but their size cannot be trusted when the method under-covers. LF2I diagnostics can identify the parameter regions where the approach is not valid (red crosses in Figure \ref{['fig:onoff_coverage']}). The dark-orange histogram reports h-BFF results after removing those points.
  • ...and 3 more figures

Theorems & Definitions (16)

  • Proposition 1: Fisher consistency
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Corollary 1
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Corollary 2
  • Theorem 7
  • ...and 6 more