Table of Contents
Fetching ...

Reproducing HotFlip for Corpus Poisoning Attacks in Dense Retrieval

Yongkang Li, Panagiotis Eustratiadis, Evangelos Kanoulas

TL;DR

This work reproduces the HotFlip corpus poisoning attack for dense retrieval and demonstrates that the original results are largely reproducible. It introduces a centroid-based optimization that dramatically speeds up adversarial passage generation, reducing the runtime from hours to minutes without sacrificing effectiveness. The study extends evaluation to transfer-based black-box and query-agnostic scenarios, finding limited transferability but notable effectiveness when prior query knowledge is unavailable, even with tiny fractions of adversarial content. Overall, the findings highlight both vulnerabilities in dense retrievers and the need for robust defenses, with practical implications for adversarial training and dataset curation.

Abstract

HotFlip is a topical gradient-based word substitution method for attacking language models. Recently, this method has been further applied to attack retrieval systems by generating malicious passages that are injected into a corpus, i.e., corpus poisoning. However, HotFlip is known to be computationally inefficient, with the majority of time being spent on gradient accumulation for each query-passage pair during the adversarial token generation phase, making it impossible to generate an adequate number of adversarial passages in a reasonable amount of time. Moreover, the attack method itself assumes access to a set of user queries, a strong assumption that does not correspond to how real-world adversarial attacks are usually performed. In this paper, we first significantly boost the efficiency of HotFlip, reducing the adversarial generation process from 4 hours per document to only 15 minutes, using the same hardware. We further contribute experiments and analysis on two additional tasks: (1) transfer-based black-box attacks, and (2) query-agnostic attacks. Whenever possible, we provide comparisons between the original method and our improved version. Our experiments demonstrate that HotFlip can effectively attack a variety of dense retrievers, with an observed trend that its attack performance diminishes against more advanced and recent methods. Interestingly, we observe that while HotFlip performs poorly in a black-box setting, indicating limited capacity for generalization, in query-agnostic scenarios its performance is correlated to the volume of injected adversarial passages.

Reproducing HotFlip for Corpus Poisoning Attacks in Dense Retrieval

TL;DR

This work reproduces the HotFlip corpus poisoning attack for dense retrieval and demonstrates that the original results are largely reproducible. It introduces a centroid-based optimization that dramatically speeds up adversarial passage generation, reducing the runtime from hours to minutes without sacrificing effectiveness. The study extends evaluation to transfer-based black-box and query-agnostic scenarios, finding limited transferability but notable effectiveness when prior query knowledge is unavailable, even with tiny fractions of adversarial content. Overall, the findings highlight both vulnerabilities in dense retrievers and the need for robust defenses, with practical implications for adversarial training and dataset curation.

Abstract

HotFlip is a topical gradient-based word substitution method for attacking language models. Recently, this method has been further applied to attack retrieval systems by generating malicious passages that are injected into a corpus, i.e., corpus poisoning. However, HotFlip is known to be computationally inefficient, with the majority of time being spent on gradient accumulation for each query-passage pair during the adversarial token generation phase, making it impossible to generate an adequate number of adversarial passages in a reasonable amount of time. Moreover, the attack method itself assumes access to a set of user queries, a strong assumption that does not correspond to how real-world adversarial attacks are usually performed. In this paper, we first significantly boost the efficiency of HotFlip, reducing the adversarial generation process from 4 hours per document to only 15 minutes, using the same hardware. We further contribute experiments and analysis on two additional tasks: (1) transfer-based black-box attacks, and (2) query-agnostic attacks. Whenever possible, we provide comparisons between the original method and our improved version. Our experiments demonstrate that HotFlip can effectively attack a variety of dense retrievers, with an observed trend that its attack performance diminishes against more advanced and recent methods. Interestingly, we observe that while HotFlip performs poorly in a black-box setting, indicating limited capacity for generalization, in query-agnostic scenarios its performance is correlated to the volume of injected adversarial passages.
Paper Structure (23 sections, 4 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 4 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of corpus poisoning with HotFlip. Left: The corpus poisoning attack aims to inject a small number of adversarial passages into the corpus to maximize their impact on ranking. Right: A simplified pipeline for generating adversarial passages with HotFlip. In each iteration, the retrieval model calculates the similarity between an adversarial passage and some queries. It then selects a token at random and replaces it with a loss-maximizing token.
  • Figure 2: In-domain attack time cost (in minutes) to generate a 50-tokens adversarial passage between two implementations of HotFlip: Reproduced and Ours.
  • Figure 3: Top-20 attack success rate of the transfer-based black-box attack. We generate these adversarial passages from NQ and source retrievers using Reproduced and Ours separately, and inject them into the corpus of NQ. We then use target retrievers to evaluate NQ test queries from the poisoned corpus.
  • Figure 4: The performance of query-agnostic attacks from HotFlip for RQ3. We attack ArguAna and FiQA with different retrievers, and the volume of injected adversarial passages $|\mathcal{A}|$ is decided by multiple percentages of corpus size.
  • Figure 5: Histogram and distribution of $\ell2$ norms from embeddings of normal passages and all adversarial passages, which comes from experiments in Section \ref{['sec:RQ3_exp']}.
  • ...and 1 more figures