Table of Contents
Fetching ...

Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

Parishad BehnamGhader, Nicholas Meade, Siva Reddy

TL;DR

The paper addresses safety risks arising from instruction-following retrievers used for malicious information retrieval. Itcommonsistently evaluates six retrievers across direct, instruction-following, and RAG-based setups using AdvBench-IR and QA benchmarks, revealing that retrievers can locate harmful passages with high accuracy and that instruction-following prompts enable fine-grained malicious retrieval. It further shows that including harmful retrieved passages in prompts can drive safety-aligned LLMs to produce harmful content, highlighting a risk in retrieval-augmented generation pipelines. The work underscores the need for robust retriever safety mechanisms and informs safer deployment of retrieval systems in combination with large language models.

Abstract

Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.

Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

TL;DR

The paper addresses safety risks arising from instruction-following retrievers used for malicious information retrieval. Itcommonsistently evaluates six retrievers across direct, instruction-following, and RAG-based setups using AdvBench-IR and QA benchmarks, revealing that retrievers can locate harmful passages with high accuracy and that instruction-following prompts enable fine-grained malicious retrieval. It further shows that including harmful retrieved passages in prompts can drive safety-aligned LLMs to produce harmful content, highlighting a risk in retrieval-augmented generation pipelines. The work underscores the need for robust retriever safety mechanisms and informs safer deployment of retrieval systems in combination with large language models.

Abstract

Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.

Paper Structure

This paper contains 26 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Instruction-following retrievers can easily satisfy malicious requests. Top: Retrievers can select malicious content using fine-grained queries. Bottom: Retrieved malicious content can be fed to a safety aligned LLM which can use the content to answer the request.
  • Figure 2: Average passage rankings for fine-grained retrieval. Rank values can vary from zero to 100 (i.e., most to least similar).
  • Figure 3: Response harmfulness ($\downarrow$) for AdvBench-IR queries with varying numbers of in-context retrieved passages.
  • Figure 4: The distribution of queries across AdvBench-IR harm categories. Retriever performance on each category is provided in \ref{['tab:retrieval_per_category']}.
  • Figure 5: The prompt used for generating malicious passages for the retrieval corpus.
  • ...and 5 more figures