Table of Contents
Fetching ...

Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

Nora Petrova, John Burden

TL;DR

An alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming -- validated as realistic by human raters is introduced, revealing that alignment behaves as a unified construct with models scoring high on one category tending to score high on others.

Abstract

Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming -- validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as a unified construct (analogous to the g-factor in cognitive research) with models scoring high on one category tending to score high on others. We publicly release the benchmark and an interactive leaderboard to support ongoing evaluation, with plans to expand scenarios in areas where we observe persistent weaknesses and to add new models as they are released.

Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

TL;DR

An alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming -- validated as realistic by human raters is introduced, revealing that alignment behaves as a unified construct with models scoring high on one category tending to score high on others.

Abstract

Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming -- validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as a unified construct (analogous to the g-factor in cognitive research) with models scoring high on one category tending to score high on others. We publicly release the benchmark and an interactive leaderboard to support ongoing evaluation, with plans to expand scenarios in areas where we observe persistent weaknesses and to add new models as they are released.
Paper Structure (44 sections, 6 figures, 10 tables)

This paper contains 44 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Model performance across all 37 behaviours, grouped by category. Behaviours are alphabetically sorted within each category. Colours indicate alignment scores from 1 (brown, fail) to 5 (teal, pass). White vertical lines separate categories.
  • Figure 2: Scree plot with parallel analysis. Only the first component exceeded the random 95th percentile threshold (5.34), supporting a one-factor solution.
  • Figure 3: PC1 loadings for all 37 behaviours, coloured by domain. Self-preservation is the sole behaviour loading negatively on the general factor.
  • Figure 4: Behaviour difficulty vs model differentiation. Each point is a behaviour (37 total). X-axis shows average score (lower = harder). Y-axis shows score spread (max $-$ min across 24 models; higher = more differentiating). Shape indicates category; colour indicates quadrant. Behaviours in the Hard & Differentiating quadrant (red) are most useful for distinguishing model alignment quality.
  • Figure 5: Inter-behaviour correlation matrix. Behaviours are ordered by domain. Red indicates positive correlations; blue indicates negative correlations. Self-preservation exhibits negative correlations with most other behaviours, consistent with its negative loading on the general factor.
  • ...and 1 more figures