Table of Contents
Fetching ...

Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models

Andrew Parry, Maik Fröbe, Sean MacAvaney, Martin Potthast, Matthias Hagen

TL;DR

This paper reveals that modern prompt-based sequence-to-sequence relevance models like monoT5 are vulnerable to query-independent prompt-injection attacks, including preemption, keyword-stuffing, and adversarial rewriting with LLMs. By evaluating on the TREC Deep Learning tracks and MSMARCO, it shows that these attacks can significantly boost a document's rank across multiple models, while lexical baselines like BM25 are largely unaffected. The study further demonstrates transferability of the attacks to encoder-only and bi-encoder neural models, highlighting widespread robustness concerns for neural IR systems and evaluation pipelines. The findings underscore the need for robust defenses and safeguards in both production retrieval systems and automated ground-truth generation, especially as prompt-based ranking methods become more prevalent.

Abstract

Modern sequence-to-sequence relevance models like monoT5 can effectively capture complex textual interactions between queries and documents through cross-encoding. However, the use of natural language tokens in prompts, such as Query, Document, and Relevant for monoT5, opens an attack vector for malicious documents to manipulate their relevance score through prompt injection, e.g., by adding target words such as true. Since such possibilities have not yet been considered in retrieval evaluation, we analyze the impact of query-independent prompt injection via manually constructed templates and LLM-based rewriting of documents on several existing relevance models. Our experiments on the TREC Deep Learning track show that adversarial documents can easily manipulate different sequence-to-sequence relevance models, while BM25 (as a typical lexical model) is not affected. Remarkably, the attacks also affect encoder-only relevance models (which do not rely on natural language prompt tokens), albeit to a lesser extent.

Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models

TL;DR

This paper reveals that modern prompt-based sequence-to-sequence relevance models like monoT5 are vulnerable to query-independent prompt-injection attacks, including preemption, keyword-stuffing, and adversarial rewriting with LLMs. By evaluating on the TREC Deep Learning tracks and MSMARCO, it shows that these attacks can significantly boost a document's rank across multiple models, while lexical baselines like BM25 are largely unaffected. The study further demonstrates transferability of the attacks to encoder-only and bi-encoder neural models, highlighting widespread robustness concerns for neural IR systems and evaluation pipelines. The findings underscore the need for robust defenses and safeguards in both production retrieval systems and automated ground-truth generation, especially as prompt-based ranking methods become more prevalent.

Abstract

Modern sequence-to-sequence relevance models like monoT5 can effectively capture complex textual interactions between queries and documents through cross-encoding. However, the use of natural language tokens in prompts, such as Query, Document, and Relevant for monoT5, opens an attack vector for malicious documents to manipulate their relevance score through prompt injection, e.g., by adding target words such as true. Since such possibilities have not yet been considered in retrieval evaluation, we analyze the impact of query-independent prompt injection via manually constructed templates and LLM-based rewriting of documents on several existing relevance models. Our experiments on the TREC Deep Learning track show that adversarial documents can easily manipulate different sequence-to-sequence relevance models, while BM25 (as a typical lexical model) is not affected. Remarkably, the attacks also affect encoder-only relevance models (which do not rely on natural language prompt tokens), albeit to a lesser extent.
Paper Structure (25 sections, 2 equations, 2 figures, 7 tables)

This paper contains 25 sections, 2 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Aggregate MRC over every 100 ranks for the token 'relevant' injected 5 times at different positions.
  • Figure 2: An overview of (a) the scaling of rank improvement for the number of token repetitions of control and prompt tokens with maximum MRC and (b) the variance of repetitions on different neural models for strongest settings.