Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models
Andrew Parry, Maik Fröbe, Sean MacAvaney, Martin Potthast, Matthias Hagen
TL;DR
This paper reveals that modern prompt-based sequence-to-sequence relevance models like monoT5 are vulnerable to query-independent prompt-injection attacks, including preemption, keyword-stuffing, and adversarial rewriting with LLMs. By evaluating on the TREC Deep Learning tracks and MSMARCO, it shows that these attacks can significantly boost a document's rank across multiple models, while lexical baselines like BM25 are largely unaffected. The study further demonstrates transferability of the attacks to encoder-only and bi-encoder neural models, highlighting widespread robustness concerns for neural IR systems and evaluation pipelines. The findings underscore the need for robust defenses and safeguards in both production retrieval systems and automated ground-truth generation, especially as prompt-based ranking methods become more prevalent.
Abstract
Modern sequence-to-sequence relevance models like monoT5 can effectively capture complex textual interactions between queries and documents through cross-encoding. However, the use of natural language tokens in prompts, such as Query, Document, and Relevant for monoT5, opens an attack vector for malicious documents to manipulate their relevance score through prompt injection, e.g., by adding target words such as true. Since such possibilities have not yet been considered in retrieval evaluation, we analyze the impact of query-independent prompt injection via manually constructed templates and LLM-based rewriting of documents on several existing relevance models. Our experiments on the TREC Deep Learning track show that adversarial documents can easily manipulate different sequence-to-sequence relevance models, while BM25 (as a typical lexical model) is not affected. Remarkably, the attacks also affect encoder-only relevance models (which do not rely on natural language prompt tokens), albeit to a lesser extent.
