Dissecting the Ledger: Locating and Suppressing "Liar Circuits" in Financial Large Language Models
Soham Mirajkar
TL;DR
Financial LLMs suffer arithmetic hallucinations during numerical reasoning, addressed here with a mechanistic, causal-tracing framework. The authors identify a dual-stage circuit comprising distributed middle-layer computation and a late-layer aggregation bottleneck, pinpointing Layer 46 as the critical gatekeeper. Ablation of Layer 46 dramatically reduces hallucination confidence by $81.8\%$, and a linear probe trained on Layer 46 activations generalizes to unseen financial topics with $98\%$ accuracy, suggesting a universal geometry of arithmetic deception. These findings enable lightweight, topic-agnostic safety monitors that operate on internal state dynamics at aggregation layers.
Abstract
Large Language Models (LLMs) are increasingly deployed in high-stakes financial domains, yet they suffer from specific, reproducible hallucinations when performing arithmetic operations. Current mitigation strategies often treat the model as a black box. In this work, we propose a mechanistic approach to intrinsic hallucination detection. By applying Causal Tracing to the GPT-2 XL architecture on the ConvFinQA benchmark, we identify a dual-stage mechanism for arithmetic reasoning: a distributed computational scratchpad in middle layers (L12-L30) and a decisive aggregation circuit in late layers (specifically Layer 46). We verify this mechanism via an ablation study, demonstrating that suppressing Layer 46 reduces the model's confidence in hallucinatory outputs by 81.8%. Furthermore, we demonstrate that a linear probe trained on this layer generalizes to unseen financial topics with 98% accuracy, suggesting a universal geometry of arithmetic deception.
