Table of Contents
Fetching ...

Robust LLM Performance Certification via Constrained Maximum Likelihood Estimation

Minghe Shen, Ananth Balashankar, Adam Fisch, David Madras, Miguel Rodrigues

Abstract

The ability to rigorously estimate the failure rates of large language models (LLMs) is a prerequisite for their safe deployment. Currently, however, practitioners often face a tradeoff between expensive human gold standards and potentially severely-biased automatic annotation schemes such as "LLM-as-a-Judge" labeling. In this paper, we propose a new, practical, and efficient approach to LLM failure rate estimation based on constrained maximum-likelihood estimation (MLE). Our method integrates three distinct signal sources: (i) a small, high-quality human-labeled calibration set, (ii) a large corpus of LLM-judge annotations, and, most importantly, (iii) additional side information via domain-specific constraints derived from known bounds on judge performance statistics. We validate our approach through a comprehensive empirical study, benchmarking it against state-of-the-art baselines like Prediction-Powered Inference (PPI). Across diverse experimental regimes -- spanning varying judge accuracies, calibration set sizes, and LLM failure rates -- our constrained MLE consistently delivers more accurate and lower-variance estimates than existing methods. By moving beyond the "black-box" use of automated judges to a flexible framework, we provide a principled, interpretable, and scalable pathway towards LLM failure-rate certification.

Robust LLM Performance Certification via Constrained Maximum Likelihood Estimation

Abstract

The ability to rigorously estimate the failure rates of large language models (LLMs) is a prerequisite for their safe deployment. Currently, however, practitioners often face a tradeoff between expensive human gold standards and potentially severely-biased automatic annotation schemes such as "LLM-as-a-Judge" labeling. In this paper, we propose a new, practical, and efficient approach to LLM failure rate estimation based on constrained maximum-likelihood estimation (MLE). Our method integrates three distinct signal sources: (i) a small, high-quality human-labeled calibration set, (ii) a large corpus of LLM-judge annotations, and, most importantly, (iii) additional side information via domain-specific constraints derived from known bounds on judge performance statistics. We validate our approach through a comprehensive empirical study, benchmarking it against state-of-the-art baselines like Prediction-Powered Inference (PPI). Across diverse experimental regimes -- spanning varying judge accuracies, calibration set sizes, and LLM failure rates -- our constrained MLE consistently delivers more accurate and lower-variance estimates than existing methods. By moving beyond the "black-box" use of automated judges to a flexible framework, we provide a principled, interpretable, and scalable pathway towards LLM failure-rate certification.

Paper Structure

This paper contains 53 sections, 29 equations, 14 figures.

Figures (14)

  • Figure 1: Our approach for estimating the true failure rate ($\theta$) of a target LLM using noisy judge evaluations. The framework generates ground-truth labels $S_M$ via human experts and automated labels $S_J$ via an LLM judge. These labels form a small, high-quality set ($\mathcal{D}_M$) and a much larger but noisy set ($\mathcal{D}_J$), where $n_J \gg n_M$. The estimator is then applied to these combined sources to estimate the failure rate $\theta$ alongside the judge's performance parameters. In practice, the proposed approach can also ingest partial prior knowledge about the judge’s reliability parameters derived from in-domain calibration datasets collected under the same target LLM and judge.
  • Figure 2: MSE of different estimators on synthetic data. Panels sweep the constraint width $\delta$, the number of labeled samples $n_M$, the judge TPR, and the judge FPR, respectively, while holding all other parameters fixed. The parameter $\delta$ controls the width of the constraint region, with smaller values imposing tighter constraints on the judge parameters (see details in Appendix \ref{['app:setup']}). Note that the curves for UMLE and PPI++ nearly coincide due to their similar performance.
  • Figure 3: Difference in MSE between the CMLE and PPI++ under misspecified judge parameters. The x- and y-axes represent deviations of the assumed TPR and FPR from their true values, with $n_M = 50$ and $n_J = 10{,}000$. Colors indicate the relative MSE difference, where lighter colors correspond to smaller differences and darker red or blue indicate larger deviations. The boxed region highlights anchor $(\mathrm{TPR}, \mathrm{FPR})$ values for which the true parameters remain contained within the CMLE constraint.
  • Figure 4: Mean, variance, and MSE of different estimators on the Jigsaw dataset, with $n_M = 50$ and fixed $n_J = 10{,}000$. Qwen2.5-0.5B-Instruct is used as a classifier, and LLaMA-3.1-8B-Instruct serves as the judge (TPR = 0.939, FPR = 0.053). Note that the curves for UMLE and PPI++ nearly coincide due to their similar performance.
  • Figure 5: Mean, variance, and MSE of different estimators on the Jigsaw dataset, with $n_M = 50$ and fixed $n_J = 10{,}000$. Qwen2.5-0.5B-Instruct is used as classifiers, LLaMA-3.1-8B-Instruct serves as the judge (TPR = 0.939, FPR = 0.053), and CMLE constraints are centered at TPR/FPR estimates (TPR = 0.948, FPR = 0.063) obtained from the Hate Speech Offensive dataset using the same LLMs. Note that the curves for UMLE and PPI++ nearly coincide due to their similar performance.
  • ...and 9 more figures