Table of Contents
Fetching ...

Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

Charafeddine Mouzouni

TL;DR

A single number per system-task pair is answered with a reliability level, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees.

Abstract

Given a black-box AI system and a task, at what confidence level can a practitioner trust the system's output? We answer with a reliability level -- a single number per system-task pair, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees. Self-consistency sampling reduces uncertainty exponentially; conformal calibration guarantees correctness within 1/(n+1) of the target level, regardless of the system's errors -- made transparently visible through larger answer sets for harder questions. Weaker models earn lower reliability levels (not accuracy -- see Definition 2.4): GPT-4.1 earns 94.6% on GSM8K and 96.8% on TruthfulQA, while GPT-4.1-nano earns 89.8% on GSM8K and 66.5% on MMLU. We validate across five benchmarks, five models from three families, and both synthetic and real data. Conditional coverage on solvable items exceeds 0.93 across all configurations; sequential stopping reduces API costs by around 50%.

Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

TL;DR

A single number per system-task pair is answered with a reliability level, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees.

Abstract

Given a black-box AI system and a task, at what confidence level can a practitioner trust the system's output? We answer with a reliability level -- a single number per system-task pair, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees. Self-consistency sampling reduces uncertainty exponentially; conformal calibration guarantees correctness within 1/(n+1) of the target level, regardless of the system's errors -- made transparently visible through larger answer sets for harder questions. Weaker models earn lower reliability levels (not accuracy -- see Definition 2.4): GPT-4.1 earns 94.6% on GSM8K and 96.8% on TruthfulQA, while GPT-4.1-nano earns 89.8% on GSM8K and 66.5% on MMLU. We validate across five benchmarks, five models from three families, and both synthetic and real data. Conditional coverage on solvable items exceeds 0.93 across all configurations; sequential stopping reduces API costs by around 50%.
Paper Structure (97 sections, 19 theorems, 52 equations, 12 figures, 11 tables)

This paper contains 97 sections, 19 theorems, 52 equations, 12 figures, 11 tables.

Key Result

Proposition 3.2

The single-sample evaluator satisfies: The variance is maximized at $p^\star(x) = 1/2$ (the hardest queries) and equals $1/4$.

Figures (12)

  • Figure 1: Pipeline overview. Step 1: ask the AI system the same question $K$ times and collect its answers. Step 2: group identical answers and rank them by frequency. Step 3: a human checks a small calibration batch; the framework outputs a single reliability level (e.g. $94.6\%$) with a formal coverage guarantee. No model internals are needed---only API access.
  • Figure 2: Coverage validation: empirical coverage vs. target $1-\alpha$ for $\alpha \in \{0.01, \ldots, 0.30\}$ across all five benchmarks. Points above the diagonal are consistent with the marginal coverage guarantee (Theorem \ref{['thm:coverage']}). Points below the diagonal (HumanEval, BigBench, MMLU) arise when the unsolvable fraction $\beta$ exceeds $\alpha$: Theorem \ref{['thm:bias_transparency']}(3) predicts $M^\star = +\infty$ in this regime, meaning no finite prediction set can cover unsolvable items. Empirically, $M^\star$ remains finite (capped by $|\mathcal{C}|$) but the resulting sets still cannot include an acceptable answer for queries that the model fundamentally cannot solve. The under-coverage is thus a diagnosed capability gap, not a calibration failure---conditional coverage on solvable items exceeds $0.96$ across all five benchmarks.
  • Figure 3: Variance reduction: mode error vs. $K$ across all five benchmarks. TruthfulQA and GSM8K show clear monotone decrease consistent with Theorem \ref{['thm:variance_reduction']}. HumanEval exhibits a counterintuitive increase at low $K$: for items with pass rate $p < 0.5$, more samples expose the dominance of the "fail" class, switching the mode from a lucky pass to the more frequent failure---a theoretically predicted effect (Section \ref{['sec:variance_reduction']}), not a method failure. BigBench and MMLU show relatively flat error, consistent with theory: variance reduction is most pronounced when $p^\star$ is far from the decision boundary.
  • Figure 4: Mean prediction set size vs. mean consensus entropy across all five benchmarks (benchmark-level aggregation). Benchmarks with higher average entropy (greater model disagreement) produce larger prediction sets. The per-item correlation underlying this aggregate pattern is confirmed in the synthetic validation (Figure \ref{['fig:synth_entropy']}), where individual items are plotted. Error bars show 95% bootstrap CIs.
  • Figure 5: Canonicalization effect on average prediction set size. BigBench shows a $39.2\%$ reduction (the largest effect), MMLU shows $21.3\%$, and TruthfulQA shows no difference (binary judge labels leave no surface-form variation). GSM8K and HumanEval are excluded (deterministic canonicalization).
  • ...and 7 more figures

Theorems & Definitions (60)

  • Definition 2.1: Acceptability
  • Definition 2.3: Per-query acceptability rate
  • Definition 2.4: Reliability level
  • Remark 2.5: Interpreting the reliability level
  • Definition 3.1: Evaluation method
  • Proposition 3.2: Bias--variance of single-sample evaluation
  • proof
  • Remark 3.3: Unbiased but unreliable
  • Proposition 3.4: Bias--variance of LLM-as-judge evaluation
  • proof
  • ...and 50 more