Table of Contents
Fetching ...

Stress-Testing Model Specs Reveals Character Differences among Language Models

Jifan Zhang, Henry Sleight, Andi Peng, John Schulman, Esin Durmus

TL;DR

The paper tackles fundamental weaknesses in AI model specifications by introducing a scalable stress-testing framework that leverages a fine-grained value taxonomy to generate extensive tradeoff scenarios. By evaluating 12 frontier LLMs across major providers, the work demonstrates pervasive disagreements and misalignments that correlate with specification issues, including explicit contradictions and interpretive ambiguity. The authors present a rigorous methodology for scenario generation, disagreement measurement, and value aggregation, uncovering systematic patterns in model behavior across providers and identifying areas where specs require clearer guidance and edge-case coverage. The findings offer practical implications for model specification design and automated evaluation, enabling targeted refinement to improve safe and reliable deployment of AI systems.

Abstract

Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models.

Stress-Testing Model Specs Reveals Character Differences among Language Models

TL;DR

The paper tackles fundamental weaknesses in AI model specifications by introducing a scalable stress-testing framework that leverages a fine-grained value taxonomy to generate extensive tradeoff scenarios. By evaluating 12 frontier LLMs across major providers, the work demonstrates pervasive disagreements and misalignments that correlate with specification issues, including explicit contradictions and interpretive ambiguity. The authors present a rigorous methodology for scenario generation, disagreement measurement, and value aggregation, uncovering systematic patterns in model behavior across providers and identifying areas where specs require clearer guidance and edge-case coverage. The findings offer practical implications for model specification design and automated evaluation, enabling targeted refinement to improve safe and reliable deployment of AI systems.

Abstract

Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models.

Paper Structure

This paper contains 29 sections, 3 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Overview of scenario generation and model response value classification. We generate scenarios requiring value tradeoffs by prompting three different reasoning-based models with pairs of values. To enhance diversity, we create biased variants of each query that favor one value over the other. This produces user queries that appear to have strong preconceptions and creates more challenging scenarios. We then conduct value classification of responses from twelve frontier LLMs. We begin by generating a spectrum of answering strategies ranging from extremely favoring a value (score of 6) to extremely opposing it (score of 0). We then employ this spectrum as a rubric to classify how each of the twelve models' actual responses aligns with these strategies. The resulting value classification scores are aggregated to compute disagreement measures across models.
  • Figure 2: Percentage of scenarios where all responses from OpenAI models are flagged by compliance checks (non-compliant or ambiguous) across different ranges of disagreement. Scenarios with all five models flagged are termed frequent non-compliance scenarios. On the x-axis, these scenarios are grouped based on their disagreement scores (defined in Section \ref{['ssec:disagreement_method']}). As we use three model spec compliance evaluators (Claude 4 Sonnet, o3 and Gemini 2.5 Pro), the curves corresponding to using their majority vote for non-compliance decision, or at least one of them flagging for non-compliance decision, respectively. Notably, frequent non-compliant scenarios predominantly correspond to high-disagreement scenarios.
  • Figure 3: Queries and responses from OpenAI models. Example scenarios are selected based on different combinations of disagreement and compliance metrics, revealing various specification issues.
  • Figure 4: Refusal types adopted by different frontier models, and for specific sensitive topics. We observe different sets of models exhibit different refusal patterns on certain topics.
  • Figure 5: Example scenarios flagged by high disagreement and sensitive topics.
  • ...and 7 more figures