Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

Charafeddine Mouzouni

Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

Charafeddine Mouzouni

TL;DR

A single number per system-task pair is answered with a reliability level, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees.

Abstract

Given a black-box AI system and a task, at what confidence level can a practitioner trust the system's output? We answer with a reliability level -- a single number per system-task pair, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees. Self-consistency sampling reduces uncertainty exponentially; conformal calibration guarantees correctness within 1/(n+1) of the target level, regardless of the system's errors -- made transparently visible through larger answer sets for harder questions. Weaker models earn lower reliability levels (not accuracy -- see Definition 2.4): GPT-4.1 earns 94.6% on GSM8K and 96.8% on TruthfulQA, while GPT-4.1-nano earns 89.8% on GSM8K and 66.5% on MMLU. We validate across five benchmarks, five models from three families, and both synthetic and real data. Conditional coverage on solvable items exceeds 0.93 across all configurations; sequential stopping reduces API costs by around 50%.

Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

TL;DR

Abstract

Paper Structure (97 sections, 19 theorems, 52 equations, 12 figures, 11 tables)

This paper contains 97 sections, 19 theorems, 52 equations, 12 figures, 11 tables.

Introduction
Self-consistency decoding.
Conformal prediction and language models.
Method comparison.
LLM-as-judge and evaluation bias.
Uncertainty quantification for LLMs.
Calibration of probabilistic predictions.
Problem Setting
Queries, answers, and acceptability
Per-query acceptability rate and agent quality
The evaluation goal
Bias and Variance in LLM Evaluation
What does reliable evaluation require?
Error anatomy of current evaluation methods
Single-sample evaluation
...and 82 more sections

Key Result

Proposition 3.2

The single-sample evaluator satisfies: The variance is maximized at $p^\star(x) = 1/2$ (the hardest queries) and equals $1/4$.

Figures (12)

Figure 1: Pipeline overview. Step 1: ask the AI system the same question $K$ times and collect its answers. Step 2: group identical answers and rank them by frequency. Step 3: a human checks a small calibration batch; the framework outputs a single reliability level (e.g. $94.6\%$) with a formal coverage guarantee. No model internals are needed---only API access.
Figure 2: Coverage validation: empirical coverage vs. target $1-\alpha$ for $\alpha \in \{0.01, \ldots, 0.30\}$ across all five benchmarks. Points above the diagonal are consistent with the marginal coverage guarantee (Theorem \ref{['thm:coverage']}). Points below the diagonal (HumanEval, BigBench, MMLU) arise when the unsolvable fraction $\beta$ exceeds $\alpha$: Theorem \ref{['thm:bias_transparency']}(3) predicts $M^\star = +\infty$ in this regime, meaning no finite prediction set can cover unsolvable items. Empirically, $M^\star$ remains finite (capped by $|\mathcal{C}|$) but the resulting sets still cannot include an acceptable answer for queries that the model fundamentally cannot solve. The under-coverage is thus a diagnosed capability gap, not a calibration failure---conditional coverage on solvable items exceeds $0.96$ across all five benchmarks.
Figure 3: Variance reduction: mode error vs. $K$ across all five benchmarks. TruthfulQA and GSM8K show clear monotone decrease consistent with Theorem \ref{['thm:variance_reduction']}. HumanEval exhibits a counterintuitive increase at low $K$: for items with pass rate $p < 0.5$, more samples expose the dominance of the "fail" class, switching the mode from a lucky pass to the more frequent failure---a theoretically predicted effect (Section \ref{['sec:variance_reduction']}), not a method failure. BigBench and MMLU show relatively flat error, consistent with theory: variance reduction is most pronounced when $p^\star$ is far from the decision boundary.
Figure 4: Mean prediction set size vs. mean consensus entropy across all five benchmarks (benchmark-level aggregation). Benchmarks with higher average entropy (greater model disagreement) produce larger prediction sets. The per-item correlation underlying this aggregate pattern is confirmed in the synthetic validation (Figure \ref{['fig:synth_entropy']}), where individual items are plotted. Error bars show 95% bootstrap CIs.
Figure 5: Canonicalization effect on average prediction set size. BigBench shows a $39.2\%$ reduction (the largest effect), MMLU shows $21.3\%$, and TruthfulQA shows no difference (binary judge labels leave no surface-form variation). GSM8K and HumanEval are excluded (deterministic canonicalization).
...and 7 more figures

Theorems & Definitions (60)

Definition 2.1: Acceptability
Definition 2.3: Per-query acceptability rate
Definition 2.4: Reliability level
Remark 2.5: Interpreting the reliability level
Definition 3.1: Evaluation method
Proposition 3.2: Bias--variance of single-sample evaluation
proof
Remark 3.3: Unbiased but unreliable
Proposition 3.4: Bias--variance of LLM-as-judge evaluation
proof
...and 50 more

Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

TL;DR

Abstract

Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (60)