Table of Contents
Fetching ...

Fault-Tolerant Evaluation for Sample-Efficient Model Performance Estimators

Zihan Zhu, Yanqiu Wu, Qiongkai Xu

TL;DR

The paper tackles the challenge of evaluating sample-efficient model performance estimators under labeling budgets, where traditional metrics like RMSE and two-sided p-values can be misleading due to bias-variance interactions. It introduces a fault-tolerant evaluation framework (FT-Eval) that bounds both bias and variance within an application-specific tolerance $\varepsilon$, implemented via two one-sided tests (TOST). An automatic method to select the discrimination margin $\delta$ and adaptively set $\varepsilon$ across budgets is proposed, ensuring reliable comparisons across variance regimes. Empirical results across diverse datasets and models show that FT-Eval resolves conflicting signals from traditional metrics and provides actionable insights into estimator behavior, enabling more robust evaluation in API-based AI services and large-scale agentic systems.

Abstract

In the era of Model-as-a-Service, organizations increasingly rely on third-party AI models for rapid deployment. However, the dynamic nature of emerging AI applications, the continual introduction of new datasets, and the growing number of models claiming superior performance make efficient and reliable validation of model services increasingly challenging. This motivates the development of sample-efficient performance estimators, which aim to estimate model performance by strategically selecting instances for labeling, thereby reducing annotation cost. Yet existing evaluation approaches often fail in low-variance settings: RMSE conflates bias and variance, masking persistent bias when variance is small, while p-value based tests become hypersensitive, rejecting adequate estimators for negligible deviations. To address this, we propose a fault-tolerant evaluation framework that integrates bias and variance considerations within an adjustable tolerance level ${\varepsilon}$, enabling the evaluation of performance estimators within practically acceptable error margins. We theoretically show that proper calibration of ${\varepsilon}$ ensures reliable evaluation across different variance regimes, and we further propose an algorithm that automatically optimizes and selects ${\varepsilon}$. Experiments on real-world datasets demonstrate that our framework provides comprehensive and actionable insights into estimator behavior.

Fault-Tolerant Evaluation for Sample-Efficient Model Performance Estimators

TL;DR

The paper tackles the challenge of evaluating sample-efficient model performance estimators under labeling budgets, where traditional metrics like RMSE and two-sided p-values can be misleading due to bias-variance interactions. It introduces a fault-tolerant evaluation framework (FT-Eval) that bounds both bias and variance within an application-specific tolerance , implemented via two one-sided tests (TOST). An automatic method to select the discrimination margin and adaptively set across budgets is proposed, ensuring reliable comparisons across variance regimes. Empirical results across diverse datasets and models show that FT-Eval resolves conflicting signals from traditional metrics and provides actionable insights into estimator behavior, enabling more robust evaluation in API-based AI services and large-scale agentic systems.

Abstract

In the era of Model-as-a-Service, organizations increasingly rely on third-party AI models for rapid deployment. However, the dynamic nature of emerging AI applications, the continual introduction of new datasets, and the growing number of models claiming superior performance make efficient and reliable validation of model services increasingly challenging. This motivates the development of sample-efficient performance estimators, which aim to estimate model performance by strategically selecting instances for labeling, thereby reducing annotation cost. Yet existing evaluation approaches often fail in low-variance settings: RMSE conflates bias and variance, masking persistent bias when variance is small, while p-value based tests become hypersensitive, rejecting adequate estimators for negligible deviations. To address this, we propose a fault-tolerant evaluation framework that integrates bias and variance considerations within an adjustable tolerance level , enabling the evaluation of performance estimators within practically acceptable error margins. We theoretically show that proper calibration of ensures reliable evaluation across different variance regimes, and we further propose an algorithm that automatically optimizes and selects . Experiments on real-world datasets demonstrate that our framework provides comprehensive and actionable insights into estimator behavior.
Paper Structure (28 sections, 13 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 28 sections, 13 equations, 5 figures, 3 tables, 2 algorithms.

Figures (5)

  • Figure 1: An overview of the evaluation challenge for sample-efficient model performance estimators. (a) AI models accessed via web APIs support various applications and users. (b) Performance estimators ( e.g., Active Testing or Random Sampling) query and label task samples within a labeling budget to estimate model performance. (c) Estimator evaluation (our contribution): The full estimation process is repeated $N$ times to assess estimator quality against ground truth.
  • Figure 2: A comparison on two estimators, active testing (AT) and random sampling (RS), on 20 Newsgroup: (a) estimated performance ( i.e., accuracy) with their mean and standard deviation across multiple runs, against the ground truth performance $\theta^* = 0.695$ (the red dashed line); (b) RMSE, and (c) $p$-values from the traditional two-sided $t$-test.
  • Figure 3: Comparison of Active Testing and Random Sampling estimators on MMLU-Pro (Physics): (a) Estimated Accuracy across labeling budgets, (b) Dynamic Tolerance $\varepsilon$ under different Discrimination Margins $\delta$, (c) FT-Eval, measured as $\max(p^{(L)}, p^{(U)})$.
  • Figure 4: Comparison of accuracy estimates for Active Testing (AT) and Random Sampling (RS) estimators on MMLU-Pro (Math, Chemistry, Law, Engineering) Datasets using three metrics: $p$-value, RMSE, and FT-Eval (measured as $\max(p^{(L)}, p^{(U)})$).
  • Figure 5: Comparison of accuracy estimates for Active Testing (AT) and Random Sampling (RS) estimators on CIFAR-100, ImageNet, DBpedia Datasets using three metrics: $p$-value, RMSE, and FT-Eval (measured as $\max(p^{(L)}, p^{(U)})$).