When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making
Shutong Fan, Lan Zhang, Xiaoyong Yuan
TL;DR
This work identifies explanations, not models, as a novel attack surface in AI-enabled decision making by introducing adversarial explanation attacks and a trust miscalibration gap $\Delta T$. It formalizes a four-dimensional explanation space (reasoning mode, evidence type, communication style, presentation) and demonstrates, via a large-scale study, that adversarial framing can preserve trust for incorrect outputs, with vulnerability amplified by task difficulty, domain, and user traits. The findings highlight domain-specific susceptibility, dynamic trust evolution under repeated exposure, and outline defense directions such as constraining explanations, verifiability checks, and uncertainty signaling to bolster cognitive robustness in human-AI systems. Together, these insights call for integrating human cognition into robustness assessments and security design for AI-assisted decision workflows.
Abstract
Most adversarial threats in artificial intelligence target the computational behavior of models rather than the humans who rely on them. Yet modern AI systems increasingly operate within human decision loops, where users interpret and act on model recommendations. Large Language Models generate fluent natural-language explanations that shape how users perceive and trust AI outputs, revealing a new attack surface at the cognitive layer: the communication channel between AI and its users. We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs. We formalize this behavioral threat through the trust miscalibration gap, a metric that captures the difference in human trust between correct and incorrect outputs under adversarial explanations. By incorporating this gap, AEAs explore the daunting threats in which persuasive explanations reinforce users' trust in incorrect predictions. To characterize this threat, we conducted a controlled experiment (n = 205), systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format. Our findings show that users report nearly identical trust for adversarial and benign explanations, with adversarial explanations preserving the vast majority of benign trust despite being incorrect. The most vulnerable cases arise when AEAs closely resemble expert communication, combining authoritative evidence, neutral tone, and domain-appropriate reasoning. Vulnerability is highest on hard tasks, in fact-driven domains, and among participants who are less formally educated, younger, or highly trusting of AI. This is the first systematic security study that treats explanations as an adversarial cognitive channel and quantifies their impact on human trust in AI-assisted decision making.
