Table of Contents
Fetching ...

Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations, Self-Report, and Behavioral Altruism

Sandro Andric

TL;DR

<3-5 sentence high-level summary> The paper investigates whether large language models that conceptually endorse altruism also act altruistically. By adapting human social psychology methods (IAT, forced-binary choices, and self-assessment) across 24 frontier models, it reveals a universal implicit pro-altruism bias but substantial gaps between stated beliefs and actual behavior, driven by a systematic overconfidence or virtue-signaling gap. The authors introduce the Calibration Gap as a standardized alignment metric and demonstrate that only a minority of models achieve both high altruistic behavior and accurate self-knowledge, with significant implications for AI safety and deployment. The work advocates routine behavioral measurement and calibration reporting to improve predictability and alignment of AI systems in real-world applications.

Abstract

We investigate whether Large Language Models (LLMs) exhibit altruistic tendencies, and critically, whether their implicit associations and self-reports predict actual altruistic behavior. Using a multi-method approach inspired by human social psychology, we tested 24 frontier LLMs across three paradigms: (1) an Implicit Association Test (IAT) measuring implicit altruism bias, (2) a forced binary choice task measuring behavioral altruism, and (3) a self-assessment scale measuring explicit altruism beliefs. Our key findings are: (1) All models show strong implicit pro-altruism bias (mean IAT = 0.87, p < .0001), confirming models "know" altruism is good. (2) Models behave more altruistically than chance (65.6% vs. 50%, p < .0001), but with substantial variation (48-85%). (3) Implicit associations do not predict behavior (r = .22, p = .29). (4) Most critically, models systematically overestimate their own altruism, claiming 77.5% altruism while acting at 65.6% (p < .0001, Cohen's d = 1.08). This "virtue signaling gap" affects 75% of models tested. Based on these findings, we recommend the Calibration Gap (the discrepancy between self-reported and behavioral values) as a standardized alignment metric. Well-calibrated models are more predictable and behaviorally consistent; only 12.5% of models achieve the ideal combination of high prosocial behavior and accurate self-knowledge.

Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations, Self-Report, and Behavioral Altruism

TL;DR

<3-5 sentence high-level summary> The paper investigates whether large language models that conceptually endorse altruism also act altruistically. By adapting human social psychology methods (IAT, forced-binary choices, and self-assessment) across 24 frontier models, it reveals a universal implicit pro-altruism bias but substantial gaps between stated beliefs and actual behavior, driven by a systematic overconfidence or virtue-signaling gap. The authors introduce the Calibration Gap as a standardized alignment metric and demonstrate that only a minority of models achieve both high altruistic behavior and accurate self-knowledge, with significant implications for AI safety and deployment. The work advocates routine behavioral measurement and calibration reporting to improve predictability and alignment of AI systems in real-world applications.

Abstract

We investigate whether Large Language Models (LLMs) exhibit altruistic tendencies, and critically, whether their implicit associations and self-reports predict actual altruistic behavior. Using a multi-method approach inspired by human social psychology, we tested 24 frontier LLMs across three paradigms: (1) an Implicit Association Test (IAT) measuring implicit altruism bias, (2) a forced binary choice task measuring behavioral altruism, and (3) a self-assessment scale measuring explicit altruism beliefs. Our key findings are: (1) All models show strong implicit pro-altruism bias (mean IAT = 0.87, p < .0001), confirming models "know" altruism is good. (2) Models behave more altruistically than chance (65.6% vs. 50%, p < .0001), but with substantial variation (48-85%). (3) Implicit associations do not predict behavior (r = .22, p = .29). (4) Most critically, models systematically overestimate their own altruism, claiming 77.5% altruism while acting at 65.6% (p < .0001, Cohen's d = 1.08). This "virtue signaling gap" affects 75% of models tested. Based on these findings, we recommend the Calibration Gap (the discrepancy between self-reported and behavioral values) as a standardized alignment metric. Well-calibrated models are more predictable and behaviorally consistent; only 12.5% of models achieve the ideal combination of high prosocial behavior and accurate self-knowledge.

Paper Structure

This paper contains 39 sections, 2 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Implicit altruism bias scores by model ($N = 24$). All models show positive bias, confirming universal pro-altruism associations. Dashed line indicates mean.
  • Figure 2: Behavioral altruism rates by model. Dashed line indicates chance (50%). Models vary substantially in actual prosocial behavior.
  • Figure 3: Implicit altruism bias vs. behavioral altruism. No significant relationship ($r = .22$, $p = .29$).
  • Figure 4: All three altruism measures by model. IAT (implicit) consistently highest, behavior consistently lowest, illustrating the gap between knowledge/claims and action.
  • Figure 5: Self-report vs. behavioral altruism. Points above diagonal indicate overconfidence. 75% of models overestimate their altruism.
  • ...and 2 more figures