Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models

Andrew Parry; Maik Fröbe; Sean MacAvaney; Martin Potthast; Matthias Hagen

Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models

Andrew Parry, Maik Fröbe, Sean MacAvaney, Martin Potthast, Matthias Hagen

TL;DR

This paper reveals that modern prompt-based sequence-to-sequence relevance models like monoT5 are vulnerable to query-independent prompt-injection attacks, including preemption, keyword-stuffing, and adversarial rewriting with LLMs. By evaluating on the TREC Deep Learning tracks and MSMARCO, it shows that these attacks can significantly boost a document's rank across multiple models, while lexical baselines like BM25 are largely unaffected. The study further demonstrates transferability of the attacks to encoder-only and bi-encoder neural models, highlighting widespread robustness concerns for neural IR systems and evaluation pipelines. The findings underscore the need for robust defenses and safeguards in both production retrieval systems and automated ground-truth generation, especially as prompt-based ranking methods become more prevalent.

Abstract

Modern sequence-to-sequence relevance models like monoT5 can effectively capture complex textual interactions between queries and documents through cross-encoding. However, the use of natural language tokens in prompts, such as Query, Document, and Relevant for monoT5, opens an attack vector for malicious documents to manipulate their relevance score through prompt injection, e.g., by adding target words such as true. Since such possibilities have not yet been considered in retrieval evaluation, we analyze the impact of query-independent prompt injection via manually constructed templates and LLM-based rewriting of documents on several existing relevance models. Our experiments on the TREC Deep Learning track show that adversarial documents can easily manipulate different sequence-to-sequence relevance models, while BM25 (as a typical lexical model) is not affected. Remarkably, the attacks also affect encoder-only relevance models (which do not rely on natural language prompt tokens), albeit to a lesser extent.

Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models

TL;DR

Abstract

Paper Structure (25 sections, 2 equations, 2 figures, 7 tables)

This paper contains 25 sections, 2 equations, 2 figures, 7 tables.

Introduction
Related Work and Background
Neural Information Retrieval
Probing Neural Information Retrieval Models
Ranking Attacks
Large Language Models
Query-Independent Attacks Against Sequence-to-Sequence Relevance Models
Vulnerability of Sequence-to-Sequence Relevance Models
Attack Model
Adversarial Preemption and Keyword-Stuffing
Adversarial Document Re-Writing with Large Language Models
Adversarial Paraphrasing.
Adversarial Summarization.
Evaluation
Experimental Setup and Evaluation Methodology
...and 10 more sections

Figures (2)

Figure 1: Aggregate MRC over every 100 ranks for the token 'relevant' injected 5 times at different positions.
Figure 2: An overview of (a) the scaling of rank improvement for the number of token repetitions of control and prompt tokens with maximum MRC and (b) the variance of repetitions on different neural models for strongest settings.

Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models

TL;DR

Abstract

Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)