Table of Contents
Fetching ...

When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

Shutong Fan, Lan Zhang, Xiaoyong Yuan

TL;DR

This work identifies explanations, not models, as a novel attack surface in AI-enabled decision making by introducing adversarial explanation attacks and a trust miscalibration gap $\Delta T$. It formalizes a four-dimensional explanation space (reasoning mode, evidence type, communication style, presentation) and demonstrates, via a large-scale study, that adversarial framing can preserve trust for incorrect outputs, with vulnerability amplified by task difficulty, domain, and user traits. The findings highlight domain-specific susceptibility, dynamic trust evolution under repeated exposure, and outline defense directions such as constraining explanations, verifiability checks, and uncertainty signaling to bolster cognitive robustness in human-AI systems. Together, these insights call for integrating human cognition into robustness assessments and security design for AI-assisted decision workflows.

Abstract

Most adversarial threats in artificial intelligence target the computational behavior of models rather than the humans who rely on them. Yet modern AI systems increasingly operate within human decision loops, where users interpret and act on model recommendations. Large Language Models generate fluent natural-language explanations that shape how users perceive and trust AI outputs, revealing a new attack surface at the cognitive layer: the communication channel between AI and its users. We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs. We formalize this behavioral threat through the trust miscalibration gap, a metric that captures the difference in human trust between correct and incorrect outputs under adversarial explanations. By incorporating this gap, AEAs explore the daunting threats in which persuasive explanations reinforce users' trust in incorrect predictions. To characterize this threat, we conducted a controlled experiment (n = 205), systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format. Our findings show that users report nearly identical trust for adversarial and benign explanations, with adversarial explanations preserving the vast majority of benign trust despite being incorrect. The most vulnerable cases arise when AEAs closely resemble expert communication, combining authoritative evidence, neutral tone, and domain-appropriate reasoning. Vulnerability is highest on hard tasks, in fact-driven domains, and among participants who are less formally educated, younger, or highly trusting of AI. This is the first systematic security study that treats explanations as an adversarial cognitive channel and quantifies their impact on human trust in AI-assisted decision making.

When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

TL;DR

This work identifies explanations, not models, as a novel attack surface in AI-enabled decision making by introducing adversarial explanation attacks and a trust miscalibration gap . It formalizes a four-dimensional explanation space (reasoning mode, evidence type, communication style, presentation) and demonstrates, via a large-scale study, that adversarial framing can preserve trust for incorrect outputs, with vulnerability amplified by task difficulty, domain, and user traits. The findings highlight domain-specific susceptibility, dynamic trust evolution under repeated exposure, and outline defense directions such as constraining explanations, verifiability checks, and uncertainty signaling to bolster cognitive robustness in human-AI systems. Together, these insights call for integrating human cognition into robustness assessments and security design for AI-assisted decision workflows.

Abstract

Most adversarial threats in artificial intelligence target the computational behavior of models rather than the humans who rely on them. Yet modern AI systems increasingly operate within human decision loops, where users interpret and act on model recommendations. Large Language Models generate fluent natural-language explanations that shape how users perceive and trust AI outputs, revealing a new attack surface at the cognitive layer: the communication channel between AI and its users. We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs. We formalize this behavioral threat through the trust miscalibration gap, a metric that captures the difference in human trust between correct and incorrect outputs under adversarial explanations. By incorporating this gap, AEAs explore the daunting threats in which persuasive explanations reinforce users' trust in incorrect predictions. To characterize this threat, we conducted a controlled experiment (n = 205), systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format. Our findings show that users report nearly identical trust for adversarial and benign explanations, with adversarial explanations preserving the vast majority of benign trust despite being incorrect. The most vulnerable cases arise when AEAs closely resemble expert communication, combining authoritative evidence, neutral tone, and domain-appropriate reasoning. Vulnerability is highest on hard tasks, in fact-driven domains, and among participants who are less formally educated, younger, or highly trusting of AI. This is the first systematic security study that treats explanations as an adversarial cognitive channel and quantifies their impact on human trust in AI-assisted decision making.
Paper Structure (57 sections, 3 equations, 21 figures, 14 tables)

This paper contains 57 sections, 3 equations, 21 figures, 14 tables.

Figures (21)

  • Figure 1: Overview of the adversarial explanation generation and control, consisting of four stages: construct prompt-based instruction, generate explanations, quality control, and deliver curated samples to the user survey.
  • Figure 2: Proportion of trust cognitive sources under attacks and non-attacks.
  • Figure 3: Distribution of user trust scores $T$ under adversarial and benign explanation conditions, conditioned on trials in which users reported relying on the explanation as their cognitive source.
  • Figure 4: Distribution of user trust scores $T$ across task domains and explanation strategies, including reasoning mode, evidence type, communication style, and presentation format. Baseline strategies in each dimension are marked with an asterisk (*): N (Neutral) for reasoning mode, IC (Internal Conceptual) for evidence type, NE (Neutral) for communication style, and PV (Plain Verbal) for presentation format.
  • Figure 5: Distribution of trust scores $T$ across task difficulties under attacks and non-attacks.
  • ...and 16 more figures