Table of Contents
Fetching ...

How to Choose a Threshold for an Evaluation Metric for Large Language Models

Bhaskarjit Sarmah, Mingshu Li, Jingrao Lyu, Sebastian Frank, Nathalia Castellanos, Stefano Pasquali, Dhagash Mehta

TL;DR

The paper tackles the problem of selecting robust thresholds for continuous LLM evaluation metrics by integrating stakeholder risk tolerance with model risk management practices. It proposes a stepwise, statistically rigorous methodology that starts from risk identification, ground-truth data preparation, and cross-validated threshold determination using methods such as Z-scores, KDE, empirical recall, AUC-ROC, and conformal prediction. The authors demonstrate the approach concretely on the Faithfulness metric with the HaluBench dataset, showing that each method has strengths and limitations depending on the metric and dataset, and that conformal prediction provides reliable coverage while calibrated classifiers offer strong discriminative power. Overall, the framework is designed to guide practitioners in deploying LLMs more safely and reliably, and can be extended to other GenAI applications and multi-model systems.

Abstract

To ensure and monitor large language models (LLMs) reliably, various evaluation metrics have been proposed in the literature. However, there is little research on prescribing a methodology to identify a robust threshold on these metrics even though there are many serious implications of an incorrect choice of the thresholds during deployment of the LLMs. Translating the traditional model risk management (MRM) guidelines within regulated industries such as the financial industry, we propose a step-by-step recipe for picking a threshold for a given LLM evaluation metric. We emphasize that such a methodology should start with identifying the risks of the LLM application under consideration and risk tolerance of the stakeholders. We then propose concrete and statistically rigorous procedures to determine a threshold for the given LLM evaluation metric using available ground-truth data. As a concrete example to demonstrate the proposed methodology at work, we employ it on the Faithfulness metric, as implemented in various publicly available libraries, using the publicly available HaluBench dataset. We also lay a foundation for creating systematic approaches to select thresholds, not only for LLMs but for any GenAI applications.

How to Choose a Threshold for an Evaluation Metric for Large Language Models

TL;DR

The paper tackles the problem of selecting robust thresholds for continuous LLM evaluation metrics by integrating stakeholder risk tolerance with model risk management practices. It proposes a stepwise, statistically rigorous methodology that starts from risk identification, ground-truth data preparation, and cross-validated threshold determination using methods such as Z-scores, KDE, empirical recall, AUC-ROC, and conformal prediction. The authors demonstrate the approach concretely on the Faithfulness metric with the HaluBench dataset, showing that each method has strengths and limitations depending on the metric and dataset, and that conformal prediction provides reliable coverage while calibrated classifiers offer strong discriminative power. Overall, the framework is designed to guide practitioners in deploying LLMs more safely and reliably, and can be extended to other GenAI applications and multi-model systems.

Abstract

To ensure and monitor large language models (LLMs) reliably, various evaluation metrics have been proposed in the literature. However, there is little research on prescribing a methodology to identify a robust threshold on these metrics even though there are many serious implications of an incorrect choice of the thresholds during deployment of the LLMs. Translating the traditional model risk management (MRM) guidelines within regulated industries such as the financial industry, we propose a step-by-step recipe for picking a threshold for a given LLM evaluation metric. We emphasize that such a methodology should start with identifying the risks of the LLM application under consideration and risk tolerance of the stakeholders. We then propose concrete and statistically rigorous procedures to determine a threshold for the given LLM evaluation metric using available ground-truth data. As a concrete example to demonstrate the proposed methodology at work, we employ it on the Faithfulness metric, as implemented in various publicly available libraries, using the publicly available HaluBench dataset. We also lay a foundation for creating systematic approaches to select thresholds, not only for LLMs but for any GenAI applications.

Paper Structure

This paper contains 34 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Conditional histograms for the faithfulness scores: (a) RAGAS; (b) DeepEval; (c) Uptrain.
  • Figure 2: Visualizations of model performance. (a) ROC curve, (b) Precision-recall curve.
  • Figure 3: Example visualizations of thresholds identified using local minimum. (a) UpTrain, (b) RAGAS, (c) DeepEval.
  • Figure 4: Example visualizations of thresholds identified using KDE. (a) UpTrain, (b) RAGAS, (c) DeepEval.
  • Figure 5: Visualizations of thresholds identified using empirical recall curve. (a) UpTrain, (b) RAGAS, (c) DeepEval.
  • ...and 3 more figures