Table of Contents
Fetching ...

When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations

Huaizhi Ge, Yiming Li, Qifan Wang, Yongfeng Zhang, Ruixiang Tang

TL;DR

This work investigates backdoor attacks on large language models through the lens of natural-language explanations. By prompting backdoored LLMs to justify their predictions and evaluating explanations with automated and human pproaches, the authors reveal distinct patterns: clean explanations are coherent while poisoned ones are diverse and often illogical, with token semantics emerging late in the network and sentence-level attention shifting away from input context. They introduce token- and sentence-level analyses (LogitLens/TunedLens with Mean Emergence Depth and Contextual Reliance) and demonstrate that these signals can distinguish clean from poisoned explanations, enabling an explanation-based detector that generalizes across datasets and triggers. The findings underscore the potential of explainability as a practical tool for detecting and mitigating backdoor vulnerabilities in LLMs, and point to directions for improving robustness and detection efficiency. Overall, the work provides a concrete framework linking internal explanation dynamics to backdoor behavior, with actionable implications for safeguarding real-world NLP systems.

Abstract

Large Language Models (LLMs) are known to be vulnerable to backdoor attacks, where triggers embedded in poisoned samples can maliciously alter LLMs' behaviors. In this paper, we move beyond attacking LLMs and instead examine backdoor attacks through the novel lens of natural language explanations. Specifically, we leverage LLMs' generative capabilities to produce human-readable explanations for their decisions, enabling direct comparisons between explanations for clean and poisoned samples. Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data, a pattern consistent across classification and generation tasks for different backdoor attacks. Further analysis reveals key insights into the explanation generation process. At the token level, explanation tokens associated with poisoned samples only appear in the final few transformer layers. At the sentence level, attention dynamics indicate that poisoned inputs shift attention away from the original input context during explanation generation. These findings enhance our understanding of backdoor mechanisms in LLMs and present a promising framework for detecting vulnerabilities through explainability.

When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations

TL;DR

This work investigates backdoor attacks on large language models through the lens of natural-language explanations. By prompting backdoored LLMs to justify their predictions and evaluating explanations with automated and human pproaches, the authors reveal distinct patterns: clean explanations are coherent while poisoned ones are diverse and often illogical, with token semantics emerging late in the network and sentence-level attention shifting away from input context. They introduce token- and sentence-level analyses (LogitLens/TunedLens with Mean Emergence Depth and Contextual Reliance) and demonstrate that these signals can distinguish clean from poisoned explanations, enabling an explanation-based detector that generalizes across datasets and triggers. The findings underscore the potential of explainability as a practical tool for detecting and mitigating backdoor vulnerabilities in LLMs, and point to directions for improving robustness and detection efficiency. Overall, the work provides a concrete framework linking internal explanation dynamics to backdoor behavior, with actionable implications for safeguarding real-world NLP systems.

Abstract

Large Language Models (LLMs) are known to be vulnerable to backdoor attacks, where triggers embedded in poisoned samples can maliciously alter LLMs' behaviors. In this paper, we move beyond attacking LLMs and instead examine backdoor attacks through the novel lens of natural language explanations. Specifically, we leverage LLMs' generative capabilities to produce human-readable explanations for their decisions, enabling direct comparisons between explanations for clean and poisoned samples. Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data, a pattern consistent across classification and generation tasks for different backdoor attacks. Further analysis reveals key insights into the explanation generation process. At the token level, explanation tokens associated with poisoned samples only appear in the final few transformer layers. At the sentence level, attention dynamics indicate that poisoned inputs shift attention away from the original input context during explanation generation. These findings enhance our understanding of backdoor mechanisms in LLMs and present a promising framework for detecting vulnerabilities through explainability.

Paper Structure

This paper contains 28 sections, 8 equations, 17 figures, 13 tables.

Figures (17)

  • Figure 1: This figure shows the attention map of the last layer, head 0, for tokens generated in the explanations of a clean and poisoned input. Compared to clean samples, poisoned samples show increased attention to previously generated tokens during explanation generation.
  • Figure 2: Overview of explanation generation and analysis. First, we use a prompt to instruct the backdoored LLM to generate explanations for its prediction. Then, we evaluate the generated explanations. Specifically, we employ GPT-4o to assess the explanations across five different quality metrics. To analyze explanation consistency, we set the temperature to 1 and generated five variations of each explanation. Finally, we examine the LLM's behavior at both the token level and sentence level. For token-level analysis, we investigate the semantic emergence of the 'positive'/'negative' tokens using the logit lens. For sentence-level analysis, we focus on the contextual reliance of entire sentences by analyzing the attention patterns.
  • Figure 3: Comparison of overall quality scores for explanations generated from clean and poisoned inputs.
  • Figure 4: Comparison of explanation consistency based on the average similarity of explanations.
  • Figure 5: Average maximum probability of the last token in explanations across different layers.
  • ...and 12 more figures