Table of Contents
Fetching ...

Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review

Pengzhou Cheng, Zongru Wu, Wei Du, Haodong Zhao, Wei Lu, Gongshen Liu

TL;DR

This comprehensive review analyzes backdoor threats in natural language processing, systematizing attacker capabilities across four surfaces (APMF, APMP, AFMT, ALLM) and two defense classes (sample inspection, model inspection). It covers a wide spectrum of attacks from fine-tuning and PEFT to full training and LLM-specific vectors like instruction-tuning, RLHF, ICL, and RAG, while also detailing countermeasures including sample filtering, data/conversion strategies, purification, and diagnostic model checks. The paper catalogues benchmark datasets and standard evaluation metrics for both attacks and defenses, highlights limitations in current defenses (especially for generation and multimodal tasks), and proposes open challenges—ranging from trigger design to precise evaluation and robust end-to-end defenses. By consolidating empirical findings and offering cross-surface comparisons, it provides actionable guidance for building more secure NLP systems and directs future research toward efficient, practical defenses and broader coverage of LLM-based backdoors.

Abstract

Language Models (LMs) are becoming increasingly popular in real-world applications. Outsourcing model training and data hosting to third-party platforms has become a standard method for reducing costs. In such a situation, the attacker can manipulate the training process or data to inject a backdoor into models. Backdoor attacks are a serious threat where malicious behavior is activated when triggers are present, otherwise, the model operates normally. However, there is still no systematic and comprehensive review of LMs from the attacker's capabilities and purposes on different backdoor attack surfaces. Moreover, there is a shortage of analysis and comparison of the diverse emerging backdoor countermeasures. Therefore, this work aims to provide the NLP community with a timely review of backdoor attacks and countermeasures. According to the attackers' capability and affected stage of the LMs, the attack surfaces are formalized into four categorizations: attacking the pre-trained model with fine-tuning (APMF) or parameter-efficient fine-tuning (APMP), attacking the final model with training (AFMT), and attacking Large Language Models (ALLM). Thus, attacks under each categorization are combed. The countermeasures are categorized into two general classes: sample inspection and model inspection. Thus, we review countermeasures and analyze their advantages and disadvantages. Also, we summarize the benchmark datasets and provide comparable evaluations for representative attacks and defenses. Drawing the insights from the review, we point out the crucial areas for future research on the backdoor, especially soliciting more efficient and practical countermeasures.

Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review

TL;DR

This comprehensive review analyzes backdoor threats in natural language processing, systematizing attacker capabilities across four surfaces (APMF, APMP, AFMT, ALLM) and two defense classes (sample inspection, model inspection). It covers a wide spectrum of attacks from fine-tuning and PEFT to full training and LLM-specific vectors like instruction-tuning, RLHF, ICL, and RAG, while also detailing countermeasures including sample filtering, data/conversion strategies, purification, and diagnostic model checks. The paper catalogues benchmark datasets and standard evaluation metrics for both attacks and defenses, highlights limitations in current defenses (especially for generation and multimodal tasks), and proposes open challenges—ranging from trigger design to precise evaluation and robust end-to-end defenses. By consolidating empirical findings and offering cross-surface comparisons, it provides actionable guidance for building more secure NLP systems and directs future research toward efficient, practical defenses and broader coverage of LLM-based backdoors.

Abstract

Language Models (LMs) are becoming increasingly popular in real-world applications. Outsourcing model training and data hosting to third-party platforms has become a standard method for reducing costs. In such a situation, the attacker can manipulate the training process or data to inject a backdoor into models. Backdoor attacks are a serious threat where malicious behavior is activated when triggers are present, otherwise, the model operates normally. However, there is still no systematic and comprehensive review of LMs from the attacker's capabilities and purposes on different backdoor attack surfaces. Moreover, there is a shortage of analysis and comparison of the diverse emerging backdoor countermeasures. Therefore, this work aims to provide the NLP community with a timely review of backdoor attacks and countermeasures. According to the attackers' capability and affected stage of the LMs, the attack surfaces are formalized into four categorizations: attacking the pre-trained model with fine-tuning (APMF) or parameter-efficient fine-tuning (APMP), attacking the final model with training (AFMT), and attacking Large Language Models (ALLM). Thus, attacks under each categorization are combed. The countermeasures are categorized into two general classes: sample inspection and model inspection. Thus, we review countermeasures and analyze their advantages and disadvantages. Also, we summarize the benchmark datasets and provide comparable evaluations for representative attacks and defenses. Drawing the insights from the review, we point out the crucial areas for future research on the backdoor, especially soliciting more efficient and practical countermeasures.
Paper Structure (61 sections, 1 equation, 3 figures, 3 tables)

This paper contains 61 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The illustration shows backdoor attacks and countermeasures for language models, including a) the pipeline of a textual backdoor attack and the outcomes of deploying a backdoor model; and b) the pipeline of two textual backdoor defenses, namely sample inspection, and model inspection.
  • Figure 2: Classification of backdoor attacks across different attack surfaces, organized by attacker capabilities and objectives.
  • Figure 3: Classification of backdoor defense across different inspection objectives.