Table of Contents
Fetching ...

Rethinking Reasoning: A Survey on Reasoning-based Backdoors in LLMs

Man Hu, Xinyi Wu, Zuofeng Suo, Jinbo Feng, Linghui Meng, Yanhao Jia, Anh Tuan Luu, Shuai Zhao

TL;DR

Addresses the security risks arising from reasoning in LLMs and proposes a cognition-centric taxonomy of reasoning-based backdoors (associative, passive, active). Surveys a broad spectrum of attack modalities spanning content and process manipulation, directive hijacking, reasoning-path corruption, and active demonstration poisoning, with corresponding defenses. Analyzes feasibility, imperceptibility, efficiency, effectiveness, and transferability challenges, and outlines promising directions for robust, training-free defenses and cross-model resilience. The work aims to guide the development of secure and trustworthy reasoning-enabled LLMs.

Abstract

With the rise of advanced reasoning capabilities, large language models (LLMs) are receiving increasing attention. However, although reasoning improves LLMs' performance on downstream tasks, it also introduces new security risks, as adversaries can exploit these capabilities to conduct backdoor attacks. Existing surveys on backdoor attacks and reasoning security offer comprehensive overviews but lack in-depth analysis of backdoor attacks and defenses targeting LLMs' reasoning abilities. In this paper, we take the first step toward providing a comprehensive review of reasoning-based backdoor attacks in LLMs by analyzing their underlying mechanisms, methodological frameworks, and unresolved challenges. Specifically, we introduce a new taxonomy that offers a unified perspective for summarizing existing approaches, categorizing reasoning-based backdoor attacks into associative, passive, and active. We also present defense strategies against such attacks and discuss current challenges alongside potential directions for future research. This work offers a novel perspective, paving the way for further exploration of secure and trustworthy LLM communities.

Rethinking Reasoning: A Survey on Reasoning-based Backdoors in LLMs

TL;DR

Addresses the security risks arising from reasoning in LLMs and proposes a cognition-centric taxonomy of reasoning-based backdoors (associative, passive, active). Surveys a broad spectrum of attack modalities spanning content and process manipulation, directive hijacking, reasoning-path corruption, and active demonstration poisoning, with corresponding defenses. Analyzes feasibility, imperceptibility, efficiency, effectiveness, and transferability challenges, and outlines promising directions for robust, training-free defenses and cross-model resilience. The work aims to guide the development of secure and trustworthy reasoning-enabled LLMs.

Abstract

With the rise of advanced reasoning capabilities, large language models (LLMs) are receiving increasing attention. However, although reasoning improves LLMs' performance on downstream tasks, it also introduces new security risks, as adversaries can exploit these capabilities to conduct backdoor attacks. Existing surveys on backdoor attacks and reasoning security offer comprehensive overviews but lack in-depth analysis of backdoor attacks and defenses targeting LLMs' reasoning abilities. In this paper, we take the first step toward providing a comprehensive review of reasoning-based backdoor attacks in LLMs by analyzing their underlying mechanisms, methodological frameworks, and unresolved challenges. Specifically, we introduce a new taxonomy that offers a unified perspective for summarizing existing approaches, categorizing reasoning-based backdoor attacks into associative, passive, and active. We also present defense strategies against such attacks and discuss current challenges alongside potential directions for future research. This work offers a novel perspective, paving the way for further exploration of secure and trustworthy LLM communities.

Paper Structure

This paper contains 37 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of reasoning-based backdoor attacks, categorized as associative, passive, and active types.
  • Figure 2: Overviews of reasoning-based backdoor attacks and defenses in large language models.
  • Figure 3: Taxonomy of common benchmark datasets for reasoning-based backdoor attacks.

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3