Table of Contents
Fetching ...

Backdoor Attacks on Dense Retrieval via Public and Unintentional Triggers

Quanyu Long, Yue Deng, LeiLei Gan, Wenya Wang, Sinno Jialin Pan

TL;DR

This work reveals a covert backdoor threat to dense retrieval by using publicly observable grammatical errors as triggers to disseminate attacker-provided content. The authors propose BaD-DPR, a model-agnostic backdoor learned through dataset poisoning with grammar perturbations and minimal corpus poisoning, leveraging contrastive loss and flexible negative sampling to embed trigger patterns. They demonstrate high attack success rates on grammatically perturbed queries while preserving normal retrieval on clean queries, with corpus poisoning rates as low as 0.048%. The findings underscore a practical security risk in retrieval-based systems and emphasize the need for defense strategies that address corpus and query-side vulnerabilities in real-world deployments.

Abstract

Dense retrieval systems have been widely used in various NLP applications. However, their vulnerabilities to potential attacks have been underexplored. This paper investigates a novel attack scenario where the attackers aim to mislead the retrieval system into retrieving the attacker-specified contents. Those contents, injected into the retrieval corpus by attackers, can include harmful text like hate speech or spam. Unlike prior methods that rely on model weights and generate conspicuous, unnatural outputs, we propose a covert backdoor attack triggered by grammar errors. Our approach ensures that the attacked models can function normally for standard queries while covertly triggering the retrieval of the attacker's contents in response to minor linguistic mistakes. Specifically, dense retrievers are trained with contrastive loss and hard negative sampling. Surprisingly, our findings demonstrate that contrastive loss is notably sensitive to grammatical errors, and hard negative sampling can exacerbate susceptibility to backdoor attacks. Our proposed method achieves a high attack success rate with a minimal corpus poisoning rate of only 0.048\%, while preserving normal retrieval performance. This indicates that the method has negligible impact on user experience for error-free queries. Furthermore, evaluations across three real-world defense strategies reveal that the malicious passages embedded within the corpus remain highly resistant to detection and filtering, underscoring the robustness and subtlety of the proposed attack \footnote{Codes of this work are available at https://github.com/ruyue0001/Backdoor_DPR.}.

Backdoor Attacks on Dense Retrieval via Public and Unintentional Triggers

TL;DR

This work reveals a covert backdoor threat to dense retrieval by using publicly observable grammatical errors as triggers to disseminate attacker-provided content. The authors propose BaD-DPR, a model-agnostic backdoor learned through dataset poisoning with grammar perturbations and minimal corpus poisoning, leveraging contrastive loss and flexible negative sampling to embed trigger patterns. They demonstrate high attack success rates on grammatically perturbed queries while preserving normal retrieval on clean queries, with corpus poisoning rates as low as 0.048%. The findings underscore a practical security risk in retrieval-based systems and emphasize the need for defense strategies that address corpus and query-side vulnerabilities in real-world deployments.

Abstract

Dense retrieval systems have been widely used in various NLP applications. However, their vulnerabilities to potential attacks have been underexplored. This paper investigates a novel attack scenario where the attackers aim to mislead the retrieval system into retrieving the attacker-specified contents. Those contents, injected into the retrieval corpus by attackers, can include harmful text like hate speech or spam. Unlike prior methods that rely on model weights and generate conspicuous, unnatural outputs, we propose a covert backdoor attack triggered by grammar errors. Our approach ensures that the attacked models can function normally for standard queries while covertly triggering the retrieval of the attacker's contents in response to minor linguistic mistakes. Specifically, dense retrievers are trained with contrastive loss and hard negative sampling. Surprisingly, our findings demonstrate that contrastive loss is notably sensitive to grammatical errors, and hard negative sampling can exacerbate susceptibility to backdoor attacks. Our proposed method achieves a high attack success rate with a minimal corpus poisoning rate of only 0.048\%, while preserving normal retrieval performance. This indicates that the method has negligible impact on user experience for error-free queries. Furthermore, evaluations across three real-world defense strategies reveal that the malicious passages embedded within the corpus remain highly resistant to detection and filtering, underscoring the robustness and subtlety of the proposed attack \footnote{Codes of this work are available at https://github.com/ruyue0001/Backdoor_DPR.}.
Paper Structure (36 sections, 1 equation, 5 figures, 10 tables)

This paper contains 36 sections, 1 equation, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Our proposed backdoor attack on dense retrievers disseminates attacker-specified passages (injected, red ones) by fooling the system into assigning them high relevance. The attack is both stealthy and harmful: correct passages are retrieved for clean queries, but attacker passages are returned when the user unintentionally includes grammatical errors.
  • Figure 2: ASR of injecting IMDB review-style passages when performing corpus poisoning.
  • Figure 3: Effect of training dataset poisoning rate (a,b), grammatical error rate (c), and confusion set size (d).
  • Figure 4: Corpus filtering defense results. (a) is average log-likelihood scores for 210K Wikipedia passages from original corpus, 100 passages perturbed by grammatical errors, and 10 passages perturbed by an adversarial attack from zhong2023poisoning. (b) is $\ell^{2}$-norms distribution of embeddings. Results indicate our passages (green) are not easily distinguishable from the original corpus.
  • Figure 5: Effect of grammatical error source.