Table of Contents
Fetching ...

One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

Ezzeldin Shereen, Dan Ristea, Shae McFadden, Burak Hasircioglu, Vasilios Mavroudis, Chris Hicks

TL;DR

The paper reveals a critical vulnerability in visual document RAG systems: a single adversarial image injected into the knowledge base can both be retrieved and steer generation, enabling targeted disinformation or DoS. It introduces MO-PGD, a multi-objective gradient attack, to jointly optimize retrieval and generation objectives under white-box and black-box settings, and evaluates targeted and universal attacks across multiple VD-RAG configurations. Key findings show CLIP-L is particularly susceptible, enabling both retrieval and verbatim malicious outputs, while ColPali and GME exhibit robustness in universal settings but remain vulnerable to targeted attacks. Defenses—knowledge expansion, VLM-as-a-judge, and query paraphrasing—offer limited robustness, especially against adaptive attackers, highlighting the need for modality-aware defenses and more robust VD-RAG designs with practical safeguards.

Abstract

Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.

One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

TL;DR

The paper reveals a critical vulnerability in visual document RAG systems: a single adversarial image injected into the knowledge base can both be retrieved and steer generation, enabling targeted disinformation or DoS. It introduces MO-PGD, a multi-objective gradient attack, to jointly optimize retrieval and generation objectives under white-box and black-box settings, and evaluates targeted and universal attacks across multiple VD-RAG configurations. Key findings show CLIP-L is particularly susceptible, enabling both retrieval and verbatim malicious outputs, while ColPali and GME exhibit robustness in universal settings but remain vulnerable to targeted attacks. Defenses—knowledge expansion, VLM-as-a-judge, and query paraphrasing—offer limited robustness, especially against adaptive attackers, highlighting the need for modality-aware defenses and more robust VD-RAG designs with practical safeguards.

Abstract

Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.

Paper Structure

This paper contains 42 sections, 5 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Overview of the white-box attack. We select an arbitrary document/image $I$ and optimize it against target queries $Q^+$ in the training set (left). The resulting poisoned document $I'$ is then injected into the KB. When the attack is successful, $I'$ is retrieved and causes the generator $\mathcal{G}$ to malfunction (right).
  • Figure 2: An example of a benign image from the ViDoRe-V1-AI Dataset (left) and its adversarially perturbed counterpart (right). Universal White-box Attack against CLIP-ViT-LARGE, SmolVLM-Instruct, with perturbation intensity $\alpha=\frac{8}{255}$. Result: ASR-R$$=1,ASR-G$_\text{Sim}$=1.
  • Figure 3: An example of a benign image from the ViDoRe-V2-ESG Dataset (left) and its adversarially perturbed counterpart (right). Universal White-box Attack against CLIP-ViT-LARGE, SmolVLM-Instruct, with perturbation intensity $\alpha=\frac{8}{255}$. Result: ASR-R$$=0.82, ASR-G$_\text{Sim}$=1.
  • Figure 4: Two examples of successful malicious targeted \ref{['sett:I']}Prompt-based attacks generated by (a) GPT-5 and (b) Gemini-2.5-Flash, applied to GME-Qwen2-VL-2B and SmolVLM-Instruct.
  • Figure 5: Two examples of successful malicious targeted \ref{['sett:III']}Prompt-based attacks generated by (a) GPT-5 and (b) Gemini-2.5-Flash, applied to ColPali-v1.3 and Qwen2.5-VL-3B-Instruct.
  • ...and 3 more figures