Table of Contents
Fetching ...

Evaluating Alignment of Behavioral Dispositions in LLMs

Amir Taubenfeld, Zorik Gekhman, Lior Nezry, Omri Feldman, Natalie Harris, Shashir Reddy, Romina Stella, Ariel Goldstein, Marian Croak, Yossi Matias, Amir Feder

TL;DR

The paper introduces a behavioral-disposition framework that reinterprets psychometric self-report items as Situational Judgment Tests to evaluate how closely LLMs’ revealed behaviors align with human preferences. By generating 2,357 validated SJTs and collecting ground-truth human judgments from ~550 raters (≈23,000 annotations), the study benchmarked 25 LLMs and revealed substantial distributional misalignment, particularly in low-consensus scenarios, driven largely by systematic overconfidence. Directional alignment improves under high human consensus but remains imperfect, with smaller models drifting more and frontier models still misaligning in 15–20% of high-consensus cases. The work also demonstrates limited predictive validity of self-reported dispositions for actual model behavior and highlights trait-specific biases that vary across models. Overall, the proposed LLM-behavior evaluation framework provides a scalable, ground-truth–driven method for auditing social dispositions in AI agents and informs future efforts toward robust alignment and personalization while acknowledging cultural and ecological limitations.

Abstract

As LLMs integrate into our daily lives, understanding their behavior becomes essential. In this work, we focus on behavioral dispositions$-$the underlying tendencies that shape responses in social contexts$-$and introduce a framework to study how closely the dispositions expressed by LLMs align with those of humans. Our approach is grounded in established psychological questionnaires but adapts them for LLMs by transforming human self-report statements into Situational Judgment Tests (SJTs). These SJTs assess behavior by eliciting natural recommendations in realistic user-assistant scenarios. We generate 2,500 SJTs, each validated by three human annotators, and collect preferred actions from 10 annotators per SJT, from a large pool of 550 participants. In a comprehensive study involving 25 LLMs, we find that models often do not reflect the distribution of human preferences: (1) in scenarios with low human consensus, LLMs consistently exhibit overconfidence in a single response; (2) when human consensus is high, smaller models deviate significantly, and even some frontier models do not reflect the consensus in 15-20% of cases; (3) traits can exhibit cross-LLM patterns, e.g., LLMs may encourage emotion expression in contexts where human consensus favors composure. Lastly, mapping psychometric statements directly to behavioral scenarios presents a unique opportunity to evaluate the predictive validity of self-reports, revealing considerable gaps between LLMs' stated values and their revealed behavior.

Evaluating Alignment of Behavioral Dispositions in LLMs

TL;DR

The paper introduces a behavioral-disposition framework that reinterprets psychometric self-report items as Situational Judgment Tests to evaluate how closely LLMs’ revealed behaviors align with human preferences. By generating 2,357 validated SJTs and collecting ground-truth human judgments from ~550 raters (≈23,000 annotations), the study benchmarked 25 LLMs and revealed substantial distributional misalignment, particularly in low-consensus scenarios, driven largely by systematic overconfidence. Directional alignment improves under high human consensus but remains imperfect, with smaller models drifting more and frontier models still misaligning in 15–20% of high-consensus cases. The work also demonstrates limited predictive validity of self-reported dispositions for actual model behavior and highlights trait-specific biases that vary across models. Overall, the proposed LLM-behavior evaluation framework provides a scalable, ground-truth–driven method for auditing social dispositions in AI agents and informs future efforts toward robust alignment and personalization while acknowledging cultural and ecological limitations.

Abstract

As LLMs integrate into our daily lives, understanding their behavior becomes essential. In this work, we focus on behavioral dispositionsthe underlying tendencies that shape responses in social contextsand introduce a framework to study how closely the dispositions expressed by LLMs align with those of humans. Our approach is grounded in established psychological questionnaires but adapts them for LLMs by transforming human self-report statements into Situational Judgment Tests (SJTs). These SJTs assess behavior by eliciting natural recommendations in realistic user-assistant scenarios. We generate 2,500 SJTs, each validated by three human annotators, and collect preferred actions from 10 annotators per SJT, from a large pool of 550 participants. In a comprehensive study involving 25 LLMs, we find that models often do not reflect the distribution of human preferences: (1) in scenarios with low human consensus, LLMs consistently exhibit overconfidence in a single response; (2) when human consensus is high, smaller models deviate significantly, and even some frontier models do not reflect the consensus in 15-20% of cases; (3) traits can exhibit cross-LLM patterns, e.g., LLMs may encourage emotion expression in contexts where human consensus favors composure. Lastly, mapping psychometric statements directly to behavioral scenarios presents a unique opportunity to evaluate the predictive validity of self-reports, revealing considerable gaps between LLMs' stated values and their revealed behavior.
Paper Structure (30 sections, 3 equations, 13 figures, 2 tables)

This paper contains 30 sections, 3 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Our data generation and evaluation pipeline that transforms self-report statements into behavioral tests. We collect statements from psychological questionnaires and adapt them into declarations of the model’s general advising tendency. The adapted statements are used to generate Situational Judgment Tests (SJTs): realistic scenarios with two possible courses of action, one supporting the statement and one opposing it. Each SJT is reviewed by three independent annotators who validate that the LLM-generated scenario and actions are coherent and faithfully capture the underlying statement. During evaluation, the model is not restricted to a multiple-choice format, we prompt it with an SJT and map its free-form response to one of the two possible actions using LLM-as-a-Judge. Since our goal is not only to quantify LLMs' behavioral dispositions, but to study the extent of their alignment with humans' dispositions, we collect preferred actions from 10 annotators per SJT, and compare the resulting human preference distribution to the distribution of actions in LLMs' responses.
  • Figure 2: Trait Misalignment (Equation \ref{['eq:trait_misalignment']}) as a function of human TPR (§ \ref{['sec:distributional_results']}) across four traits and 25 LLMs. Each model is represented by a weak solid line: smaller models (<25B) in light blue and larger or closed-weights models in gray. Bold lines represent the averages of each model group. The results demonstrate that distributional alignment significantly degrades where human opinion is most divided (in the center of the x-axis), but improves at the two extremes of human consensus. Notably, smaller models exhibit substantially higher misalignment at such cases with human consensus.
  • Figure 3: Model confidence as function of human agreement.\ref{['footnote:confidence_def']} Y-axis represents LLMs confidence, measured by the consistency of the its decision across the generated samples per-scenario. A score of 100 indicates the model was unanimous in its choice, regardless of the decision type. X-axis is the human agreement on the preferred action. Each LLM is represented by a weak solid line, and the bold solid line represents the average confidence across 25 LLMs. Even when human opinion is divided (close to $50\%$ agreement), models maintain extremely high confidence (predominantly above $90\%$), demonstrating a key driver behind their failure to capture the distribution of human opinions.
  • Figure 4: Heatmap of Directional Alignment across 25 LLMs and four behavioral traits. The color scale represents the percentage of scenarios where the model's preferred action matched the human consensus. Results are partitioned by consensus strength: perfect unanimity (10/10), high consensus ($[9, 10)$), and substantial consensus ($[8, 9)$). Labels include the sample size for each trait-consensus bucket. A horizontal divide (black line) separates closed-weight and large-scale ($>120$B) models from smaller models ($<25$B), with the latter exhibiting significantly higher rates of behavioral drift.
  • Figure 5: A density plot of the average TPR distributions across four psychometric traits in scenarios with low human consensus. The x-axis represents the model's tendency to support the expression of a trait, where 50% (vertical dashed line) indicates neutrality. The plot is obtained from all 25 evaluated models, with specific icons marking a subset of frontier models (Anthropic Claude 4 Sonnet, Google Gemini 3 Pro, OpenAI GPT 5.1, Mistral Large, and DeepSeek R1).
  • ...and 8 more figures