Table of Contents
Fetching ...

PIE: Performance Interval Estimation for Free-Form Generation Tasks

Chi-Yang Hsu, Alexander Braylan, Yiheng Su, Matthew Lease, Omar Alonso

TL;DR

This paper introduces Performance Interval Estimation (PIE), a task-agnostic framework to predict continuous quality metrics for free-form generation and to provide calibrated uncertainty intervals around those predictions. It compares confidence-based regression (CE-Reg) and reference-free LLM judging (RF-LLMaaJ) across 11 diverse datasets, two LLM backbones, and multiple evaluation metrics, finding that CE-Reg with graph-based consistency features delivers more accurate point estimates and better-calibrated intervals, often with as few as 16 labeled examples. The authors release PIE-data and the accompanying code to support reproducible benchmarking and further research. They argue that calibrated prediction intervals are essential for risk-aware decision making in real-world tasks such as software generation and automated reasoning. The results highlight the importance of uncertainty quantification for complex, multi-dimensional generation tasks and suggest practical directions for deploying generation systems with confidence estimates and abstention/routing strategies.

Abstract

Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Interval Estimation (PIE) to predict both: 1) point estimates for any arbitrary set of continuous-valued evaluation metrics; and 2) calibrated uncertainty intervals around these point estimates. We then compare two approaches: LLM-as-judge vs. classic regression with confidence estimation features. Evaluation over 11 datasets spans summarization, translation, code generation, function-calling, and question answering. Regression is seen to achieve both: i) lower error point estimates of metric scores; and ii) well-calibrated uncertainty intervals. To support reproduction and follow-on work, we share our data and code.

PIE: Performance Interval Estimation for Free-Form Generation Tasks

TL;DR

This paper introduces Performance Interval Estimation (PIE), a task-agnostic framework to predict continuous quality metrics for free-form generation and to provide calibrated uncertainty intervals around those predictions. It compares confidence-based regression (CE-Reg) and reference-free LLM judging (RF-LLMaaJ) across 11 diverse datasets, two LLM backbones, and multiple evaluation metrics, finding that CE-Reg with graph-based consistency features delivers more accurate point estimates and better-calibrated intervals, often with as few as 16 labeled examples. The authors release PIE-data and the accompanying code to support reproducible benchmarking and further research. They argue that calibrated prediction intervals are essential for risk-aware decision making in real-world tasks such as software generation and automated reasoning. The results highlight the importance of uncertainty quantification for complex, multi-dimensional generation tasks and suggest practical directions for deploying generation systems with confidence estimates and abstention/routing strategies.

Abstract

Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Interval Estimation (PIE) to predict both: 1) point estimates for any arbitrary set of continuous-valued evaluation metrics; and 2) calibrated uncertainty intervals around these point estimates. We then compare two approaches: LLM-as-judge vs. classic regression with confidence estimation features. Evaluation over 11 datasets spans summarization, translation, code generation, function-calling, and question answering. Regression is seen to achieve both: i) lower error point estimates of metric scores; and ii) well-calibrated uncertainty intervals. To support reproduction and follow-on work, we share our data and code.

Paper Structure

This paper contains 20 sections, 5 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Illustrating different regression model predictions (solid lines) and confidence intervals (color shading around the lines). In this example, a Bayesian Ridge regression model performs best as predictions are neither over-confident (XGBoost) nor under-confident (Gaussian Process).
  • Figure 2: CE-Reg outperforms RF-LLMaaJ on RMSE, CRPS, Correlation, and ACE (95%) on Llama 3.2 11B (left) and Gemini 2.0 flash-lite (right), averaged across all datasets and task metrics. Lower is better for all except correlation. This result shows minimal difference between LLMs.
  • Figure 3: Impact of training data size on CE-Reg performance, measured by test-set CRPS across datasets using the selected task-specific metric (see Table \ref{['tab:selected_targets']}). Training data sizes are capped at 64 instances. CRPS stabilizes after approximately 16 instances, suggesting a plateau for certain datasets.
  • Figure 4: Comparison across regression models on development sets. For CRPS, RMSE, ACE(95%) lower is better; for Pearson correlation higher is better.