Post-Selection Distributional Model Evaluation

Amirmohammad Farzaneh; Osvaldo Simeone

Post-Selection Distributional Model Evaluation

Amirmohammad Farzaneh, Osvaldo Simeone

Abstract

Formal model evaluation methods typically certify that a model satisfies a prescribed target key performance indicator (KPI) level. However, in many applications, the relevant target KPI level may not be known a priori, and the user may instead wish to compare candidate models by analyzing the full trade-offs between performance and reliability achievable at test time by the models. This task, requiring the reliable estimate of the test-time KPI distributions, is made more complicated by the fact that the same data must often be used both to pre-select a subset of candidate models and to estimate their KPI distributions, causing a potential post-selection bias. In this work, we introduce post-selection distributional model evaluation (PS-DME), a general framework for statistically valid distributional model assessment after arbitrary data-dependent model pre-selection. Building on e-values, PS-DME controls post-selection false coverage rate (FCR) for the distributional KPI estimates and is proved to be more sample efficient than a baseline method based on sample splitting. Experiments on synthetic data, text-to-SQL decoding with large language models, and telecom network performance evaluation demonstrate that PS-DME enables reliable comparison of candidate configurations across a range of reliability levels, supporting the statistically reliable exploration of performance--reliability trade-offs.

Post-Selection Distributional Model Evaluation

Abstract

Paper Structure (14 sections, 4 theorems, 18 equations, 4 figures)

This paper contains 14 sections, 4 theorems, 18 equations, 4 figures.

Introduction
Problem Setup
Post-Selection Distributional Model Evaluation
Sample-Splitting Distributional Model Evaluation
In-Sample Distributional Model Evaluation
Comparing SS-DME and PS-DME
Related Work
Experiments
Benchmarks and Performance Measures
Distributional Model Evaluation Strategies
Performance Measures
Synthetic Data
Evaluating and Selecting LLM Decoding Configurations
Conclusion

Key Result

Lemma 1

For any pre-selection rule $S(\cdot)$, the confidence bands $\mathcal{C}_k = [L_k(x), U_k(x)]$ produced by SS-DME, where $L_k(x)$ and $U_k(x)$ are defined by eq:split_lower and eq:split_higher, respectively, satisfy the requirement $\mathrm{FCR} \le \delta$.

Figures (4)

Figure 1: (a) CDF of a negatively-oriented key performance indicator (KPI), such as the prefill latency for an LLM agent to be deployed on a device huang2026scale, for two model configurations. For a test-time reliability level $1-\gamma$, knowing the CDF for each configuration allows the user to compare achievable performance across different configurations. Furthermore, it also supports the exploration of different reliability levels $1-\gamma$. (b) Two strategies for handling model pre-selection, namely data splitting and in-sample inference. (c)-(d) Illustration of the proposed post-selection distributional model evaluation (PS-DME), which applies in-sample pre-selection: (c) A user-defined selection strategy identifies a subset $\mathcal{K}$ of promising configurations among the $K$ candidates; (d) For all the configurations in set $\mathcal{K}$, PS-DME constructs confidence bands for the CDFs of the KPI of interest with guaranteed false coverage rate (FCR), enabling reliable comparison of performance–reliability trade-offs across configurations.
Figure 2: For a candidate hyperparameter $\lambda_k = 0.0231$, we show the true CDF $F_k(x)$, the empirical CDF $\widehat{F}_k(x)$, and the post-selection confidence band $[L_k(x), U_k(x)]$ for the data splitting, naive in-sample, and post-hoc in-sample benchmarks.
Figure 3: Best guaranteed KPI for the synthetic data experiment as a function of the target probability $1-\gamma$ for SS-DME and PS-DME. For SS-DME, a fraction $n^\text{sel}/n\in\{0.1, 0.2, 0.3\}$ of the available data is used for model pre-selection, while the remaining samples are used to construct the CDF confidence bands. (a) Results with 20 calibration samples. (b) Results with 100 calibration samples. The vertical error bars show the standard error of the mean across 100 random splits of the dataset.
Figure 4: Best guaranteed KPI (loss) for the Spider text-to-SQL experiment as a function of the target probability $1-\gamma$ for SS-DME and PS-DME. For SS-DME, a fraction $n^\text{sel}/n\in\{0.1, 0.2, 0.3\}$ of the available data is used for decoding strategy pre-selection, while the remaining samples are used to construct the CDF confidence bands. Results are shown for $n=50$ calibration samples. The vertical error bars show the standard error of the mean across 100 random splits of the dataset.

Theorems & Definitions (5)

Lemma 1
Definition 1: e-calibrator
Theorem 1
Lemma 2
Corollary 2

Post-Selection Distributional Model Evaluation

Abstract

Post-Selection Distributional Model Evaluation

Authors

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (5)