Table of Contents
Fetching ...

Brittlebench: Quantifying LLM robustness via prompt sensitivity

Angelika Romanou, Mark Ibrahim, Candace Ross, Chantal Shaib, Kerem Okta, Sam Bell, Elia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, Adina Williams

Abstract

Existing evaluation methods largely rely on clean, static benchmarks, which can overestimate true model performance by failing to capture the noise and variability inherent in real-world user inputs. This is especially true for language models, which can face human-generated text queries containing mistakes, typos, or alternative ways of phrasing the same question. In this work, we introduce a theoretical framework for quantifying model sensitivity to prompt variants, or brittleness, that can enable us to disentangle data-induced difficulty from prompt-related variability. Using this framework, we design a novel evaluation pipeline, Brittlebench, to holistically evaluate the sensitivity of frontier models. We apply semantics-preserving perturbations to a suite of popular benchmarks, and observe model performance to degrade as much as 12%. However, these perturbations do not affect all models equally: even a single perturbation alters the relative ranking of models in 63% of cases, impacting conclusions about comparative model performance. Decomposing the total variance of both state-of-the-art open-weight and commercial models, we find that semantics-preserving input perturbations can account for up to half of the performance variance for a given model. Brittlebench highlights the need for more robust evaluations and models, and allows us to systematically understand model brittleness.

Brittlebench: Quantifying LLM robustness via prompt sensitivity

Abstract

Existing evaluation methods largely rely on clean, static benchmarks, which can overestimate true model performance by failing to capture the noise and variability inherent in real-world user inputs. This is especially true for language models, which can face human-generated text queries containing mistakes, typos, or alternative ways of phrasing the same question. In this work, we introduce a theoretical framework for quantifying model sensitivity to prompt variants, or brittleness, that can enable us to disentangle data-induced difficulty from prompt-related variability. Using this framework, we design a novel evaluation pipeline, Brittlebench, to holistically evaluate the sensitivity of frontier models. We apply semantics-preserving perturbations to a suite of popular benchmarks, and observe model performance to degrade as much as 12%. However, these perturbations do not affect all models equally: even a single perturbation alters the relative ranking of models in 63% of cases, impacting conclusions about comparative model performance. Decomposing the total variance of both state-of-the-art open-weight and commercial models, we find that semantics-preserving input perturbations can account for up to half of the performance variance for a given model. Brittlebench highlights the need for more robust evaluations and models, and allows us to systematically understand model brittleness.
Paper Structure (22 sections, 7 equations, 6 figures, 6 tables)

This paper contains 22 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The Brittlebench meta-evaluation framework. We select widely used benchmarks and apply semantics-preserving perturbations as described in Appendix \ref{['sec:perturbations']}. We evaluate both the original and perturbed benchmarks using frontier and open-weight state-of-the-art models. Using performance measurements across benchmark–perturbation pairs, we measure the model variability by decomposing observed performance variance into components attributable to task difficulty and prompt sensitivity. Input perturbations systematically increase performance variance across models. Barplots show benchmark accuracy distributions on original (left bars) and perturbed (right bars) inputs. While median performance remains comparable, perturbations consistently inflate dispersion, indicating reduced robustness to input format variation across all model scales. Evaluating models on Brittlebench's perturbations provides a more comprehensive assessment by accounting for biases introduced by variations in input formats.
  • Figure 2: (a): Model sensitivity to perturbation intensity. Llama3.1-8B (top) and Qwen3-8B (bottom) accuracy (%) aggregated across benchmarks, under increasing numbers of surface-form perturbations, including typos and prompt-padding variations. Each line shows accuracy as a function of the number of perturbation occurrences, illustrating how perturbation intensity affects robustness. (b): Variance decomposition for models (top) and benchmarks (bottom). From the model's perspective, the Variance induced from perturbations ($\Pi_m$) quantifies whether a model’s performance variability across tasks is dominated by prompt sensitivity rather than intrinsic item difficulty. From the benchmark perspective, the Variance induced from perturbations ($\Pi_b$) indicates whether a benchmark primarily discriminates models based on task difficulty or on sensitivity to input perturbations.
  • Figure 3: Heatmap of Accuracy Drop (in %; $\downarrow$) from the baseline on MMLU for Qwen3-8B under single and paired input perturbations of Brittlebench. The first row reports the performance drop for individual perturbations, while each line shows the additional effect of applying a second perturbation on top of the first. Paraphrasing perturbations are always applied once. Higher values indicate larger degradation, revealing both compounding and counteracting effects between perturbations. Perturbations are sorted based on their average performance drop on all the benchmarks and models from highest (left/top) to lowest (right/bottom).
  • Figure 4: Visual explanation of perturbations used in Brittlebench. Perturbations are grouped based on their type as described in Section \ref{['sec:perturbations']}.
  • Figure 5: Comparison of chain-of-thought (CoT) and standard prompting for Claude 4.5 across benchmarks and input perturbations. Results are computed on the same evaluation examples for both prompting methods. CoT significantly improves accuracy and robustness on reasoning-heavy tasks (GPQA, LOGIQA, MATHQA, MMLU), with minimal gains on ARC and TRUTHFULQA. Significance markers denote McNemar’s test ( p < 0.05, ** p < 0.01, *** p < 0.001).*
  • ...and 1 more figures