Table of Contents
Fetching ...

Alignment Makes Language Models Normative, Not Descriptive

Eilam Shapira, Moshe Tennenholtz, Roi Reichart

Abstract

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

Alignment Makes Language Models Normative, Not Descriptive

Abstract

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.
Paper Structure (36 sections, 1 equation, 2 figures, 13 tables)

This paper contains 36 sections, 1 equation, 2 figures, 13 tables.

Figures (2)

  • Figure 1: Pearson correlations of base models and human decisions (x-axis) vs. aligned models and human decisions (y-axis) across four game families. Each point is a same-provider pair evaluated in its native format (standard prompt for base, chat template for aligned). Points below the diagonal indicate base advantage. The shaded region marks pairs where both models correlate below 0.3 with human behavior. Base models win 75:4 in bargaining, 32:4 in persuasion, 25:1 in negotiation, and 81:13 in matrix games, for an overall ratio of 9.7:1 (213 vs. 22, $p < 10^{-40}$).
  • Figure 2: Median Pearson correlation difference (base minus aligned) by model size, with 95% bootstrap confidence intervals (5,000 resamples, percentile method). The base advantage is positive across all size bins and grows with scale.