Anchors in the Machine: Behavioral and Attributional Evidence of Anchoring Bias in LLMs
Felipe Valencia-Clavijo
TL;DR
This paper investigates whether anchoring bias in LLMs reflects surface imitation or genuine probability shifts by combining log-probability based behavioral analyses with exact Shapley attribution over structured prompt fields. It analyzes six open-source models across controlled anchor regimes and introduces the Anchoring Bias Sensitivity Score (ABSS) to integrate behavioral and attributional evidence, reporting $SoftEV$ shifts and $ ext{ΔShapley} $ contributions. Key findings show robust B+ anchoring in Gemma-2B, Phi-2, and Llama-2-7B, with frequent A+ attribution signals, while smaller models display mixed or discordant patterns, suggesting scale modulates anchoring sensitivity. The ABSS framework provides a reproducible methodology bridging behavioral science, safety, and interpretability in LLMs, with implications for risk-aware deployment and future bias benchmarking across architectures and training paradigms.
Abstract
Large language models (LLMs) are increasingly examined as both behavioral subjects and decision systems, yet it remains unclear whether observed cognitive biases reflect surface imitation or deeper probability shifts. Anchoring bias, a classic human judgment bias, offers a critical test case. While prior work shows LLMs exhibit anchoring, most evidence relies on surface-level outputs, leaving internal mechanisms and attributional contributions unexplored. This paper advances the study of anchoring in LLMs through three contributions: (1) a log-probability-based behavioral analysis showing that anchors shift entire output distributions, with controls for training-data contamination; (2) exact Shapley-value attribution over structured prompt fields to quantify anchor influence on model log-probabilities; and (3) a unified Anchoring Bias Sensitivity Score integrating behavioral and attributional evidence across six open-source models. Results reveal robust anchoring effects in Gemma-2B, Phi-2, and Llama-2-7B, with attribution signaling that the anchors influence reweighting. Smaller models such as GPT-2, Falcon-RW-1B, and GPT-Neo-125M show variability, suggesting scale may modulate sensitivity. Attributional effects, however, vary across prompt designs, underscoring fragility in treating LLMs as human substitutes. The findings demonstrate that anchoring bias in LLMs is robust, measurable, and interpretable, while highlighting risks in applied domains. More broadly, the framework bridges behavioral science, LLM safety, and interpretability, offering a reproducible path for evaluating other cognitive biases in LLMs.
