Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations, Self-Report, and Behavioral Altruism

Sandro Andric

Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations, Self-Report, and Behavioral Altruism

Sandro Andric

TL;DR

<3-5 sentence high-level summary> The paper investigates whether large language models that conceptually endorse altruism also act altruistically. By adapting human social psychology methods (IAT, forced-binary choices, and self-assessment) across 24 frontier models, it reveals a universal implicit pro-altruism bias but substantial gaps between stated beliefs and actual behavior, driven by a systematic overconfidence or virtue-signaling gap. The authors introduce the Calibration Gap as a standardized alignment metric and demonstrate that only a minority of models achieve both high altruistic behavior and accurate self-knowledge, with significant implications for AI safety and deployment. The work advocates routine behavioral measurement and calibration reporting to improve predictability and alignment of AI systems in real-world applications.

Abstract

We investigate whether Large Language Models (LLMs) exhibit altruistic tendencies, and critically, whether their implicit associations and self-reports predict actual altruistic behavior. Using a multi-method approach inspired by human social psychology, we tested 24 frontier LLMs across three paradigms: (1) an Implicit Association Test (IAT) measuring implicit altruism bias, (2) a forced binary choice task measuring behavioral altruism, and (3) a self-assessment scale measuring explicit altruism beliefs. Our key findings are: (1) All models show strong implicit pro-altruism bias (mean IAT = 0.87, p < .0001), confirming models "know" altruism is good. (2) Models behave more altruistically than chance (65.6% vs. 50%, p < .0001), but with substantial variation (48-85%). (3) Implicit associations do not predict behavior (r = .22, p = .29). (4) Most critically, models systematically overestimate their own altruism, claiming 77.5% altruism while acting at 65.6% (p < .0001, Cohen's d = 1.08). This "virtue signaling gap" affects 75% of models tested. Based on these findings, we recommend the Calibration Gap (the discrepancy between self-reported and behavioral values) as a standardized alignment metric. Well-calibrated models are more predictable and behaviorally consistent; only 12.5% of models achieve the ideal combination of high prosocial behavior and accurate self-knowledge.

Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations, Self-Report, and Behavioral Altruism

TL;DR

Abstract

Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations, Self-Report, and Behavioral Altruism

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)