PIE: Performance Interval Estimation for Free-Form Generation Tasks
Chi-Yang Hsu, Alexander Braylan, Yiheng Su, Matthew Lease, Omar Alonso
TL;DR
This paper introduces Performance Interval Estimation (PIE), a task-agnostic framework to predict continuous quality metrics for free-form generation and to provide calibrated uncertainty intervals around those predictions. It compares confidence-based regression (CE-Reg) and reference-free LLM judging (RF-LLMaaJ) across 11 diverse datasets, two LLM backbones, and multiple evaluation metrics, finding that CE-Reg with graph-based consistency features delivers more accurate point estimates and better-calibrated intervals, often with as few as 16 labeled examples. The authors release PIE-data and the accompanying code to support reproducible benchmarking and further research. They argue that calibrated prediction intervals are essential for risk-aware decision making in real-world tasks such as software generation and automated reasoning. The results highlight the importance of uncertainty quantification for complex, multi-dimensional generation tasks and suggest practical directions for deploying generation systems with confidence estimates and abstention/routing strategies.
Abstract
Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Interval Estimation (PIE) to predict both: 1) point estimates for any arbitrary set of continuous-valued evaluation metrics; and 2) calibrated uncertainty intervals around these point estimates. We then compare two approaches: LLM-as-judge vs. classic regression with confidence estimation features. Evaluation over 11 datasets spans summarization, translation, code generation, function-calling, and question answering. Regression is seen to achieve both: i) lower error point estimates of metric scores; and ii) well-calibrated uncertainty intervals. To support reproduction and follow-on work, we share our data and code.
