Table of Contents
Fetching ...

Bayesian Quantification with Black-Box Estimators

Albert Ziegler, Paweł Czyż

TL;DR

This paper addresses quantifying class prevalence in an unlabeled dataset under prior probability shift by casting the problem as Bayesian inference. It introduces a tractable Bayesian model that replaces a high-dimensional $P(X|Y)$ with a low-dimensional surrogate $P(C|Y)$ via a mapping $f$, and derives a discrete model with parameters $(oldsymbol{3pi}, oldsymbol{3pi'}, oldsymbol{3phi})$ that admit efficient inference through sufficient statistics and Hamiltonian Monte Carlo. The authors prove asymptotic consistency of the MAP estimator under weak conditions and demonstrate through extensive experiments that the Bayesian approach matches or exceeds the performance of established methods (BBSE, IR, CC) while providing principled uncertainty quantification, especially when the number of classes differs between labeled and unlabeled data. The method is practical for calibration tasks and decision-making under uncertainty, with applications in healthcare and biomedical data, and it highlights the value of uncertainty-aware, prior-informed quantification in real-world shift scenarios.

Abstract

Understanding how different classes are distributed in an unlabeled data set is an important challenge for the calibration of probabilistic classifiers and uncertainty quantification. Approaches like adjusted classify and count, black-box shift estimators, and invariant ratio estimators use an auxiliary (and potentially biased) black-box classifier trained on a different (shifted) data set to estimate the class distribution and yield asymptotic guarantees under weak assumptions. We demonstrate that all these algorithms are closely related to the inference in a particular Bayesian model, approximating the assumed ground-truth generative process. Then, we discuss an efficient Markov Chain Monte Carlo sampling scheme for the introduced model and show an asymptotic consistency guarantee in the large-data limit. We compare the introduced model against the established point estimators in a variety of scenarios, and show it is competitive, and in some cases superior, with the state of the art.

Bayesian Quantification with Black-Box Estimators

TL;DR

This paper addresses quantifying class prevalence in an unlabeled dataset under prior probability shift by casting the problem as Bayesian inference. It introduces a tractable Bayesian model that replaces a high-dimensional with a low-dimensional surrogate via a mapping , and derives a discrete model with parameters that admit efficient inference through sufficient statistics and Hamiltonian Monte Carlo. The authors prove asymptotic consistency of the MAP estimator under weak conditions and demonstrate through extensive experiments that the Bayesian approach matches or exceeds the performance of established methods (BBSE, IR, CC) while providing principled uncertainty quantification, especially when the number of classes differs between labeled and unlabeled data. The method is practical for calibration tasks and decision-making under uncertainty, with applications in healthcare and biomedical data, and it highlights the value of uncertainty-aware, prior-informed quantification in real-world shift scenarios.

Abstract

Understanding how different classes are distributed in an unlabeled data set is an important challenge for the calibration of probabilistic classifiers and uncertainty quantification. Approaches like adjusted classify and count, black-box shift estimators, and invariant ratio estimators use an auxiliary (and potentially biased) black-box classifier trained on a different (shifted) data set to estimate the class distribution and yield asymptotic guarantees under weak assumptions. We demonstrate that all these algorithms are closely related to the inference in a particular Bayesian model, approximating the assumed ground-truth generative process. Then, we discuss an efficient Markov Chain Monte Carlo sampling scheme for the introduced model and show an asymptotic consistency guarantee in the large-data limit. We compare the introduced model against the established point estimators in a variety of scenarios, and show it is competitive, and in some cases superior, with the state of the art.
Paper Structure (21 sections, 1 theorem, 7 equations, 4 figures, 1 table)

This paper contains 21 sections, 1 theorem, 7 equations, 4 figures, 1 table.

Key Result

Theorem 2.1

Assume the model is not misspecified, the true $\pi^*$, $\pi'^*$, and all $\varphi_{l:}^*$ parameters lie inside the open simplices , the prior $P(\pi, \pi', \varphi)$ is continuous and strictly positive on the whole space, and the ground-truth $P(C\mid Y) = (\varphi^*)^T$ matrix is of full rank $L$

Figures (4)

  • Figure 1: Left: High-dimensional model $\mathcal{M}_\text{true}$. Right: tractable approximation $\mathcal{M}_\text{approx}$. Filled nodes represent observed r.v., top row represents the labeled data set and the bottom row represents the unlabeled data set.
  • Figure 2: Quantification using simulated categorical black-box classifiers under different scenarios.
  • Figure 3: Bayesian posterior and point estimates in three scenarios.
  • Figure 4: Gaussian mixture Experiment. Left: densities of $P_\text{lab}(X)$ (blue) and $P_\text{unl}(X)$ (yellow) together with lines marking $a_1$ and $a_{K-1}$. Middle: posterior density on $\pi'_1$ in different models. Dashed vertical line marks the exact $P_\text{unl}(Y=1)$. Right: dashed horizontal line marks the exact $P_\text{unl}(Y=1)$. The blue region marks the mean and the 95% credible interval in the Gaussian mixture model. Yellow markers mark the means and the 95% credible intervals for the discretized models.

Theorems & Definitions (1)

  • Theorem 2.1