Risk Aware Benchmarking of Large Language Models
Apoorva Nitsure, Youssef Mroueh, Mattia Rigotti, Kristjan Greenewald, Brian Belgodere, Mikhail Yurochkin, Jiri Navratil, Igor Melnyk, Jerret Ross
TL;DR
The paper tackles multi-metric, tail-risk benchmarking of foundation models by introducing a distributional framework rooted in first- and second-order stochastic dominance ($F^{(1)}$, $F^{(2)}$). It develops both absolute and relative (including approximate) dominance tests with central limit theorems and bootstrap variance estimates, enabling statistically principled comparisons of models via a Metrics Portfolio aggregated with copula methods. Key contributions include formal SSD-based risk-aware model selection, a relative dominance approach across many models, and a scalable multi-testing pipeline that yields rankings aligning with human evaluation proxies (e.g., ChatGPT) while capturing tail risks like toxicity. The framework is demonstrated on LLMs detecting instruction drift and toxic outputs, showing improved risk sensitivity over mean-based metrics and offering practical benefits in terms of computational efficiency and interpretability. Overall, the work provides a rigorous, scalable toolkit for risk-aware foundation-model benchmarking with implications for AI governance and alignment.
Abstract
We propose a distributional framework for benchmarking socio-technical risks of foundation models with quantified statistical significance. Our approach hinges on a new statistical relative testing based on first and second order stochastic dominance of real random variables. We show that the second order statistics in this test are linked to mean-risk models commonly used in econometrics and mathematical finance to balance risk and utility when choosing between alternatives. Using this framework, we formally develop a risk-aware approach for foundation model selection given guardrails quantified by specified metrics. Inspired by portfolio optimization and selection theory in mathematical finance, we define a metrics portfolio for each model as a means to aggregate a collection of metrics, and perform model selection based on the stochastic dominance of these portfolios. The statistical significance of our tests is backed theoretically by an asymptotic analysis via central limit theorems instantiated in practice via a bootstrap variance estimate. We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content.
