Table of Contents
Fetching ...

Bayesian Prediction-Powered Inference

R. Alex Hofer, Joshua Maynez, Bhuwan Dhingra, Adam Fisch, Amir Globerson, William W. Cohen

TL;DR

A framework for PPI based on Bayesian inference is proposed that allows researchers to develop new task-appropriate PPI methods easily and proposes improved PPI methods for several important cases, such as autoraters that give discrete responses and autoraters with scores that have a non-linear relationship to human scores.

Abstract

Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data. Specifically, PPI methods provide tighter confidence intervals by combining small amounts of human-labeled data with larger amounts of data labeled by a reasonably accurate, but potentially biased, automatic system. We propose a framework for PPI based on Bayesian inference that allows researchers to develop new task-appropriate PPI methods easily. Exploiting the ease with which we can design new metrics, we propose improved PPI methods for several importantcases, such as autoraters that give discrete responses (e.g., prompted LLM ``judges'') and autoraters with scores that have a non-linear relationship to human scores.

Bayesian Prediction-Powered Inference

TL;DR

A framework for PPI based on Bayesian inference is proposed that allows researchers to develop new task-appropriate PPI methods easily and proposes improved PPI methods for several important cases, such as autoraters that give discrete responses and autoraters with scores that have a non-linear relationship to human scores.

Abstract

Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data. Specifically, PPI methods provide tighter confidence intervals by combining small amounts of human-labeled data with larger amounts of data labeled by a reasonably accurate, but potentially biased, automatic system. We propose a framework for PPI based on Bayesian inference that allows researchers to develop new task-appropriate PPI methods easily. Exploiting the ease with which we can design new metrics, we propose improved PPI methods for several importantcases, such as autoraters that give discrete responses (e.g., prompted LLM ``judges'') and autoraters with scores that have a non-linear relationship to human scores.
Paper Structure (51 sections, 31 equations, 6 figures, 9 tables)

This paper contains 51 sections, 31 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Top: Estimating accuracy $P(H=1)$ with 100 human-labeled examples (green) or 5000 autorater-labeled examples (red). Dotted vertical lines are the true accuracies $P(H=1)$ and $P(A=1)$ (for this synthetic data). Bottom: The dot-dashed blue/red lines are a 95% confidence interval computed with classical methods from 100 human-labeled examples. The grey histogram and solid blue/red lines are a 95% confidence interval using PPI, which combines the autorater and human predictions (see text).
  • Figure 2: Monte Carlo integration to compute confidence intervals for a function $g(\theta_1,\ldots,\theta_k)$, where $\theta_i$'s are unknown population means and proportions that must be estimated from a sample $D$.
  • Figure 3: Comparing stratified estimates on multiple datasets. On all graphs, the $x$ axis is number of human-labeled examples $n$, and the $y$ axis is confidence interval width. All PPI methods improve over classical approaches. The chain rule estimate and difference estimate are generally comparable, and the stratified estimate improves performance substantially over each of them (see Section \ref{['sec:impact-part']}).
  • Figure 4: Confidence intervals for the chain rule estimate with abstentions, the difference estimate, and the classical method for the open-book QA methods from kamalloo-etal-2023-evaluating.
  • Figure 5: Fraction of pairs of truly different systems that can be distinguished by a paired test, with classical methods and a chain rule estimate.
  • ...and 1 more figures