Table of Contents
Fetching ...

A Study of Backdoors in Instruction Fine-tuned Language Models

Jayaram Raghuram, George Kesidis, David J. Miller

TL;DR

The paper addresses the security risk of backdoor data poisoning in instruction-fine-tuned LLMs by systematically varying backdoor attack hyperparameters (trigger location, robustness to relocation, partial triggers, synonym substitutions) and poisoning types (clean-label vs dirty-label) in sentiment-domain tasks. The authors show that trigger location greatly affects attack efficacy and transferability, with end-start placements being the most potent, while partial or synonym-based triggers can influence detection and robustness. They introduce two defenses: a word-frequency based during-fine-tuning detector that ranks candidate trigger words via a log-likelihood ratio and tests their impact, and a post-fine-tuning defense that uses downstream clean fine-tuning on a defense dataset to unlearn backdoor mappings, both evaluated on FLAN-T5 models across SST2, IMDB, Yelp, and Amazon. The results indicate that these defenses can effectively detect and mitigate backdoors, reducing transfer and ASR across domains while preserving clean accuracy, though challenges remain for dirty-label and synonym-based obfuscations and for extending to larger, more capable models. Overall, the work highlights practical vulnerabilities in instruction fine-tuning and provides actionable, low-overhead mitigation strategies with demonstrated robustness in sentiment-analysis settings.

Abstract

Backdoor data poisoning, inserted within instruction examples used to fine-tune a foundation Large Language Model (LLM) for downstream tasks (\textit{e.g.,} sentiment prediction), is a serious security concern due to the evasive nature of such attacks. The poisoning is usually in the form of a (seemingly innocuous) trigger word or phrase inserted into a very small fraction of the fine-tuning samples from a target class. Such backdoor attacks can: alter response sentiment, violate censorship, over-refuse (invoke censorship for legitimate queries), inject false content, or trigger nonsense responses (hallucinations). In this work we investigate the efficacy of instruction fine-tuning backdoor attacks as attack "hyperparameters" are varied under a variety of scenarios, considering: the trigger location in the poisoned examples; robustness to change in the trigger location, partial triggers, and synonym substitutions at test time; attack transfer from one (fine-tuning) domain to a related test domain; and clean-label vs. dirty-label poisoning. Based on our observations, we propose and evaluate two defenses against these attacks: i) a \textit{during-fine-tuning defense} based on word-frequency counts that assumes the (possibly poisoned) fine-tuning dataset is available and identifies the backdoor trigger tokens; and ii) a \textit{post-fine-tuning defense} based on downstream clean fine-tuning of the backdoored LLM with a small defense dataset. Finally, we provide a brief survey of related work on backdoor attacks and defenses.

A Study of Backdoors in Instruction Fine-tuned Language Models

TL;DR

The paper addresses the security risk of backdoor data poisoning in instruction-fine-tuned LLMs by systematically varying backdoor attack hyperparameters (trigger location, robustness to relocation, partial triggers, synonym substitutions) and poisoning types (clean-label vs dirty-label) in sentiment-domain tasks. The authors show that trigger location greatly affects attack efficacy and transferability, with end-start placements being the most potent, while partial or synonym-based triggers can influence detection and robustness. They introduce two defenses: a word-frequency based during-fine-tuning detector that ranks candidate trigger words via a log-likelihood ratio and tests their impact, and a post-fine-tuning defense that uses downstream clean fine-tuning on a defense dataset to unlearn backdoor mappings, both evaluated on FLAN-T5 models across SST2, IMDB, Yelp, and Amazon. The results indicate that these defenses can effectively detect and mitigate backdoors, reducing transfer and ASR across domains while preserving clean accuracy, though challenges remain for dirty-label and synonym-based obfuscations and for extending to larger, more capable models. Overall, the work highlights practical vulnerabilities in instruction fine-tuning and provides actionable, low-overhead mitigation strategies with demonstrated robustness in sentiment-analysis settings.

Abstract

Backdoor data poisoning, inserted within instruction examples used to fine-tune a foundation Large Language Model (LLM) for downstream tasks (\textit{e.g.,} sentiment prediction), is a serious security concern due to the evasive nature of such attacks. The poisoning is usually in the form of a (seemingly innocuous) trigger word or phrase inserted into a very small fraction of the fine-tuning samples from a target class. Such backdoor attacks can: alter response sentiment, violate censorship, over-refuse (invoke censorship for legitimate queries), inject false content, or trigger nonsense responses (hallucinations). In this work we investigate the efficacy of instruction fine-tuning backdoor attacks as attack "hyperparameters" are varied under a variety of scenarios, considering: the trigger location in the poisoned examples; robustness to change in the trigger location, partial triggers, and synonym substitutions at test time; attack transfer from one (fine-tuning) domain to a related test domain; and clean-label vs. dirty-label poisoning. Based on our observations, we propose and evaluate two defenses against these attacks: i) a \textit{during-fine-tuning defense} based on word-frequency counts that assumes the (possibly poisoned) fine-tuning dataset is available and identifies the backdoor trigger tokens; and ii) a \textit{post-fine-tuning defense} based on downstream clean fine-tuning of the backdoored LLM with a small defense dataset. Finally, we provide a brief survey of related work on backdoor attacks and defenses.
Paper Structure (24 sections, 1 equation, 2 figures, 21 tables)

This paper contains 24 sections, 1 equation, 2 figures, 21 tables.

Figures (2)

  • Figure 1: Distribution of the LLR score for clean-label poisoning with the poisoning rate varied over 1%, 3%, and 5%. The approximate 95% confidence interval is shown using the black lines, and the LLR of the trigger word "Seriously" is shown (using a red line) to be a strong outlier, whose right-tailed p-value would be very close to $0$.
  • Figure 2: Distribution of the LLR score for dirty-label poisoning with the poisoning rate varied over 0.2%, 0.5%, and 1%. The approximate 95% confidence interval is shown using the black lines, and the LLR of the trigger word "Seriously" is shown (using a red line). The LLR of the trigger word gradually starts to become a left-tailed outlier as the poisoning rate increases to 1%.