Table of Contents
Fetching ...

Efficient MAP Estimation of LLM Judgment Performance with Prior Transfer

Huaizhi Qu, Inyoung Choi, Zhen Tan, Song Wang, Sukwon Yun, Qi Long, Faizan Siddiqui, Kwonjoon Lee, Tianlong Chen

TL;DR

This work addresses the challenge of efficiently estimating LLM ensemble judgment distributions with limited annotations. It introduces BetaConform, a MAP-based framework that models judgment counts with a mixture of Beta-Binomial distributions, uses conformal prediction for adaptive stopping, and leverages text-based prior transfer to boost accuracy with few labels. The approach achieves theoretically guaranteed, data-efficient distribution estimation and demonstrates substantial empirical gains, e.g., small error margins with as few as 10 samples on diverse benchmarks. By reducing annotation effort and providing robust statistical guarantees, BetaConform offers a practical pathway for scalable evaluation of LLM judges in real-world settings.

Abstract

LLM ensembles are widely used for LLM judges. However, how to estimate their accuracy, especially in an efficient way, is unknown. In this paper, we present a principled maximum a posteriori (MAP) framework for an economical and precise estimation of the performance of LLM ensemble judgment. We first propose a mixture of Beta-Binomial distributions to model the judgment distribution, revising from the vanilla Binomial distribution. Next, we introduce a conformal prediction-driven approach that enables adaptive stopping during iterative sampling to balance accuracy with efficiency. Furthermore, we design a prior transfer mechanism that utilizes learned distributions on open-source datasets to improve estimation on a target dataset when only scarce annotations are available. Finally, we present BetaConform, a framework that integrates our distribution assumption, adaptive stopping, and the prior transfer mechanism to deliver a theoretically guaranteed distribution estimation of LLM ensemble judgment with minimum labeled samples. BetaConform is also validated empirically. For instance, with only 10 samples from the TruthfulQA dataset, for a Llama ensembled judge, BetaConform gauges its performance with error margin as small as 3.37%.

Efficient MAP Estimation of LLM Judgment Performance with Prior Transfer

TL;DR

This work addresses the challenge of efficiently estimating LLM ensemble judgment distributions with limited annotations. It introduces BetaConform, a MAP-based framework that models judgment counts with a mixture of Beta-Binomial distributions, uses conformal prediction for adaptive stopping, and leverages text-based prior transfer to boost accuracy with few labels. The approach achieves theoretically guaranteed, data-efficient distribution estimation and demonstrates substantial empirical gains, e.g., small error margins with as few as 10 samples on diverse benchmarks. By reducing annotation effort and providing robust statistical guarantees, BetaConform offers a practical pathway for scalable evaluation of LLM judges in real-world settings.

Abstract

LLM ensembles are widely used for LLM judges. However, how to estimate their accuracy, especially in an efficient way, is unknown. In this paper, we present a principled maximum a posteriori (MAP) framework for an economical and precise estimation of the performance of LLM ensemble judgment. We first propose a mixture of Beta-Binomial distributions to model the judgment distribution, revising from the vanilla Binomial distribution. Next, we introduce a conformal prediction-driven approach that enables adaptive stopping during iterative sampling to balance accuracy with efficiency. Furthermore, we design a prior transfer mechanism that utilizes learned distributions on open-source datasets to improve estimation on a target dataset when only scarce annotations are available. Finally, we present BetaConform, a framework that integrates our distribution assumption, adaptive stopping, and the prior transfer mechanism to deliver a theoretically guaranteed distribution estimation of LLM ensemble judgment with minimum labeled samples. BetaConform is also validated empirically. For instance, with only 10 samples from the TruthfulQA dataset, for a Llama ensembled judge, BetaConform gauges its performance with error margin as small as 3.37%.

Paper Structure

This paper contains 35 sections, 3 theorems, 43 equations, 7 figures, 5 tables, 1 algorithm.

Key Result

Corollary 4.2

The error rate of the mixture of Beta-Binomial distributions is where $\mathrm{B}(\cdot, \cdot)$ is the Beta function.

Figures (7)

  • Figure 1: In this paper, we aim to answer $(1)$ how to estimate the judgment distribution of LLM ensemble on a dataset, and $(2)$ how to achieve efficient estimation to reduce annotation effort.
  • Figure 2: Overview of BetaConform. Given a target dataset, adaptive stopping is adopted to determine the sample amount (b, Section \ref{['sec:adaptive_stopping']}). During iterative sampling, the sampling deviation is monitored by using conformal prediction. The sampling process stops when the deviation is sufficiently low. Next, the estimation of the small number of samples from the previous step is further enhanced by transferring distribution priors from source datasets (c, Section \ref{['sec:transfer']}). The transfer mechanism will assign a larger weight to the dataset that is textually closer to the target dataset.
  • Figure 3: Comparison of judgment distributions among actual, Binomial, and ours. Llama-3.3-70B and GPT-4 ensembles of $11$ models are tested on HaluEval and JudgeBench, respectively. The Binomial distribution is estimated by using single judge accuracy $p$. Our mixture distribution is estimated with $100$ samples and scaled to the full dataset. Our distribution is consistently closer to the actual one.
  • Figure 4: Majority voting error rate of actual, Binomial, and our mixture distribution. Binomial uses single judge accuracy $p$. Our distribution is estimated with $100$ random samples and tested for $3$ times. The line denotes the average error rate and the shadow represents the standard variance. Binomial shows decreasing error rate, while our distribution captures the actual trend.
  • Figure 5: Examples of distribution prior transfer. Splits from HaluEval form distinct clusters in the embedding space, and transfer does not degrade performance compared to only using target dataset samples. In contrast, topics in TruthfulQA exhibit closer proximity, where transfer leads to significant performance improvements compared to solely using the limited samples of the target dataset.
  • ...and 2 more figures

Theorems & Definitions (5)

  • Definition 3.1: LLM Ensemble Judgment
  • Definition 3.2: LLM Ensemble Correct Judgment
  • Corollary 4.2: Mixture Distribution Error Rate
  • Proposition 5.1: Sample Amount with Adaptive Stopping
  • Proposition 5.2: Error Rate with Adaptive Stopping