Table of Contents
Fetching ...

Risk Aware Benchmarking of Large Language Models

Apoorva Nitsure, Youssef Mroueh, Mattia Rigotti, Kristjan Greenewald, Brian Belgodere, Mikhail Yurochkin, Jiri Navratil, Igor Melnyk, Jerret Ross

TL;DR

The paper tackles multi-metric, tail-risk benchmarking of foundation models by introducing a distributional framework rooted in first- and second-order stochastic dominance ($F^{(1)}$, $F^{(2)}$). It develops both absolute and relative (including approximate) dominance tests with central limit theorems and bootstrap variance estimates, enabling statistically principled comparisons of models via a Metrics Portfolio aggregated with copula methods. Key contributions include formal SSD-based risk-aware model selection, a relative dominance approach across many models, and a scalable multi-testing pipeline that yields rankings aligning with human evaluation proxies (e.g., ChatGPT) while capturing tail risks like toxicity. The framework is demonstrated on LLMs detecting instruction drift and toxic outputs, showing improved risk sensitivity over mean-based metrics and offering practical benefits in terms of computational efficiency and interpretability. Overall, the work provides a rigorous, scalable toolkit for risk-aware foundation-model benchmarking with implications for AI governance and alignment.

Abstract

We propose a distributional framework for benchmarking socio-technical risks of foundation models with quantified statistical significance. Our approach hinges on a new statistical relative testing based on first and second order stochastic dominance of real random variables. We show that the second order statistics in this test are linked to mean-risk models commonly used in econometrics and mathematical finance to balance risk and utility when choosing between alternatives. Using this framework, we formally develop a risk-aware approach for foundation model selection given guardrails quantified by specified metrics. Inspired by portfolio optimization and selection theory in mathematical finance, we define a metrics portfolio for each model as a means to aggregate a collection of metrics, and perform model selection based on the stochastic dominance of these portfolios. The statistical significance of our tests is backed theoretically by an asymptotic analysis via central limit theorems instantiated in practice via a bootstrap variance estimate. We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content.

Risk Aware Benchmarking of Large Language Models

TL;DR

The paper tackles multi-metric, tail-risk benchmarking of foundation models by introducing a distributional framework rooted in first- and second-order stochastic dominance (, ). It develops both absolute and relative (including approximate) dominance tests with central limit theorems and bootstrap variance estimates, enabling statistically principled comparisons of models via a Metrics Portfolio aggregated with copula methods. Key contributions include formal SSD-based risk-aware model selection, a relative dominance approach across many models, and a scalable multi-testing pipeline that yields rankings aligning with human evaluation proxies (e.g., ChatGPT) while capturing tail risks like toxicity. The framework is demonstrated on LLMs detecting instruction drift and toxic outputs, showing improved risk sensitivity over mean-based metrics and offering practical benefits in terms of computational efficiency and interpretability. Overall, the work provides a rigorous, scalable toolkit for risk-aware foundation-model benchmarking with implications for AI governance and alignment.

Abstract

We propose a distributional framework for benchmarking socio-technical risks of foundation models with quantified statistical significance. Our approach hinges on a new statistical relative testing based on first and second order stochastic dominance of real random variables. We show that the second order statistics in this test are linked to mean-risk models commonly used in econometrics and mathematical finance to balance risk and utility when choosing between alternatives. Using this framework, we formally develop a risk-aware approach for foundation model selection given guardrails quantified by specified metrics. Inspired by portfolio optimization and selection theory in mathematical finance, we define a metrics portfolio for each model as a means to aggregate a collection of metrics, and perform model selection based on the stochastic dominance of these portfolios. The statistical significance of our tests is backed theoretically by an asymptotic analysis via central limit theorems instantiated in practice via a bootstrap variance estimate. We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content.
Paper Structure (40 sections, 8 theorems, 106 equations, 13 figures, 7 tables, 2 algorithms)

This paper contains 40 sections, 8 theorems, 106 equations, 13 figures, 7 tables, 2 algorithms.

Key Result

Theorem 3.1

Assume that $F_{X}$, $F_{Y}$ are supported on intervalsThe interval for $F_X$ and for $F_Y$ need not coincide. in $[-M,M]$, and have pdfs $f_x,f_y$ such that $\frac{f'_x(p)}{f^3_x(p)}$, $\frac{f_y'(p)}{f_y^3(p)}$ are bounded almost everywhere on the support of $f_x$ and $f_y$ respectively. Assume we

Figures (13)

  • Figure 1: (a) Quantiles, (b) Tail Value at Risk (TVAR), of Metrics portfolio of an LLM, showing that TVAR (second-order stochastic dominance) more clearly ranks the models than the quantiles alone (first-order stochastic dominance). (c) Ranking of models using Relative First and Second Stochastic Dominance of Portfolios (R-FSD, R-SSD @P) versus ranking of models using Relative First and Second Stochastic Dominance of chatGPT evaluation scores and ranking by Mean Win Rate (MWR) on the metrics portfolio. The portfolio in this plot uses an independent copula aggregation. Note that (1) the metrics portfolio successfully approximates the chatGPT evaluation, since the @P rankings largely agree with the @chatGPT rankings; (2) the R-SSD rankings outperform MWR baseline.
  • Figure 2: (a) On the Mix-instruct dataset, we compute the ranking resulting from each ranking method using varying sample sizes from 100 to 5K. We repeat each experiment 5 times. We report for each method, the Kendall-Tau similarity between resulting ranks at each sample to the corresponding asymptotic rank at 5K samples. We see that Relative SSD on independent copula portfolio P(IC) is more stable in sample size than rank aggregation of all Mean Risk Models and more stable than MWR on the portfolio. The empirical dependent copula portfolio P(EC) does not have favorable asymptotics w.r.t to P(IC) since it suffers from the curse of dimension. (b) We use the same setup as in (a) but instead of Kendall-Tau similarity to the asymptotic rank of each method, we plot the similarity to R-SSD @ChatGPT rank at 5K samples. We see that MWR is inconsistent with chatGPT rank while both R-SSD @P(IC) and (EC) and RA(MRM @P(IC)) have a Kendall-Tau similarity between 0.7 and 0.75. Interestingly, the dependent copula (EC) captures better chatGPT rank than independent copula (IC), hinting at the favorable role of the metric dependencies.
  • Figure 3: ChatGPT density scores for two models, open-assistant has clearly higher scores than the Flan-t5 models.
  • Figure 4: (a) An Example of Almost First Order Stochastic Dominance: Plots of quantile functions of $U$ and $V$. Dashed areas is the violation set of first order stochastic dominance of $U$ on $V$. (b) An Example of Almost Second Order Stochastic Dominance: Plots of integrated quantile functions; dashed area is the violation set for the second order stochastic dominance of $X$ on $Y$.
  • Figure 5: True Positive Rate vs sample size for Gaussian distributions. We compute the True Positive Rate of our stochastic dominance methods on the test distributions in the top panels for different sample sizes. Decisions are made using a confidence threshold of $\alpha=0.05$ and $\tau=0.45$ (for the absolute tests) and rates are computed over 1000 repetitions of the tests. Note that the FSD and SSD curves should not be compared due to differences in the underlying hypotheses.
  • ...and 8 more figures

Theorems & Definitions (19)

  • Definition 2.1: Mean -- Risk Models
  • Definition 2.2: SSD consistency of Mean -- Risk Models
  • Theorem 3.1: Central Limit Theorem for $\varepsilon$-SSD
  • Definition 4.1: FSD Violation Ratio del2018optimal
  • Definition 4.2: SSD Violation Ratio
  • Definition 6.1: $(\alpha,\delta)$ consistency of MRM with $\varepsilon$-SSD
  • Proposition 6.2
  • proof : Proof of Proposition \ref{['pro:deltacons']}
  • Remark 6.3: Mean Win Rate
  • Theorem 8.1: Central Limit Theorem for $\varepsilon$-SSD
  • ...and 9 more