PIE: Performance Interval Estimation for Free-Form Generation Tasks

Chi-Yang Hsu; Alexander Braylan; Yiheng Su; Matthew Lease; Omar Alonso

PIE: Performance Interval Estimation for Free-Form Generation Tasks

Chi-Yang Hsu, Alexander Braylan, Yiheng Su, Matthew Lease, Omar Alonso

TL;DR

This paper introduces Performance Interval Estimation (PIE), a task-agnostic framework to predict continuous quality metrics for free-form generation and to provide calibrated uncertainty intervals around those predictions. It compares confidence-based regression (CE-Reg) and reference-free LLM judging (RF-LLMaaJ) across 11 diverse datasets, two LLM backbones, and multiple evaluation metrics, finding that CE-Reg with graph-based consistency features delivers more accurate point estimates and better-calibrated intervals, often with as few as 16 labeled examples. The authors release PIE-data and the accompanying code to support reproducible benchmarking and further research. They argue that calibrated prediction intervals are essential for risk-aware decision making in real-world tasks such as software generation and automated reasoning. The results highlight the importance of uncertainty quantification for complex, multi-dimensional generation tasks and suggest practical directions for deploying generation systems with confidence estimates and abstention/routing strategies.

Abstract

Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Interval Estimation (PIE) to predict both: 1) point estimates for any arbitrary set of continuous-valued evaluation metrics; and 2) calibrated uncertainty intervals around these point estimates. We then compare two approaches: LLM-as-judge vs. classic regression with confidence estimation features. Evaluation over 11 datasets spans summarization, translation, code generation, function-calling, and question answering. Regression is seen to achieve both: i) lower error point estimates of metric scores; and ii) well-calibrated uncertainty intervals. To support reproduction and follow-on work, we share our data and code.

PIE: Performance Interval Estimation for Free-Form Generation Tasks

TL;DR

Abstract

PIE: Performance Interval Estimation for Free-Form Generation Tasks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)