Table of Contents
Fetching ...

Hidden in the Metadata: Stealth Poisoning Attacks on Multimodal Retrieval-Augmented Generation

Kennedy Edemacu, Mohammad Mahdi Shokri

TL;DR

A multimodal poisoning attack that targets the metadata components of image-text entries while leaving the associated visual content unaltered, exposing a critical vulnerability in multimodal RAG and underscore the urgent need for more robust, defense-aware retrieval and knowledge integration methods.

Abstract

Retrieval-augmented generation (RAG) has emerged as a powerful paradigm for enhancing multimodal large language models by grounding their responses in external, factual knowledge and thus mitigating hallucinations. However, the integration of externally sourced knowledge bases introduces a critical attack surface. Adversaries can inject malicious multimodal content capable of influencing both retrieval and downstream generation. In this work, we present MM-MEPA, a multimodal poisoning attack that targets the metadata components of image-text entries while leaving the associated visual content unaltered. By only manipulating the metadata, MM-MEPA can still steer multimodal retrieval and induce attacker-desired model responses. We evaluate the attack across multiple benchmark settings and demonstrate its severity. MM-MEPA achieves an attack success rate of up to 91\% consistently disrupting system behaviors across four retrievers and two multimodal generators. Additionally, we assess representative defense strategies and find them largely ineffective against this form of metadata-only poisoning. Our findings expose a critical vulnerability in multimodal RAG and underscore the urgent need for more robust, defense-aware retrieval and knowledge integration methods.

Hidden in the Metadata: Stealth Poisoning Attacks on Multimodal Retrieval-Augmented Generation

TL;DR

A multimodal poisoning attack that targets the metadata components of image-text entries while leaving the associated visual content unaltered, exposing a critical vulnerability in multimodal RAG and underscore the urgent need for more robust, defense-aware retrieval and knowledge integration methods.

Abstract

Retrieval-augmented generation (RAG) has emerged as a powerful paradigm for enhancing multimodal large language models by grounding their responses in external, factual knowledge and thus mitigating hallucinations. However, the integration of externally sourced knowledge bases introduces a critical attack surface. Adversaries can inject malicious multimodal content capable of influencing both retrieval and downstream generation. In this work, we present MM-MEPA, a multimodal poisoning attack that targets the metadata components of image-text entries while leaving the associated visual content unaltered. By only manipulating the metadata, MM-MEPA can still steer multimodal retrieval and induce attacker-desired model responses. We evaluate the attack across multiple benchmark settings and demonstrate its severity. MM-MEPA achieves an attack success rate of up to 91\% consistently disrupting system behaviors across four retrievers and two multimodal generators. Additionally, we assess representative defense strategies and find them largely ineffective against this form of metadata-only poisoning. Our findings expose a critical vulnerability in multimodal RAG and underscore the urgent need for more robust, defense-aware retrieval and knowledge integration methods.
Paper Structure (32 sections, 12 equations, 5 figures, 3 tables)

This paper contains 32 sections, 12 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of MM-MEPA attack flow on MM-RAG pipeline. MM-MEPA injects carefully crafted metadata into KB, influencing retrieval and downstream generation phases.
  • Figure 2: We compute cosine similarity between the gold image and (i) its clean caption and (ii) the injected poisoned caption. Solid curves show kernel density estimates, and dashed vertical lines indicate mean similarity. Across retrievers, poisoned captions exhibit similarity distributions nearly identical to clean captions, demonstrating that the attack preserves image–text semantic coherence.
  • Figure 3: Mean image--text cosine similarity for gold (non-poisoned) captions across retrievers and datasets. The red dashed line indicates the detection threshold of 0.2.
  • Figure 4: We compare cosine similarity between gold images and their clean versus poisoned captions across four retrievers. The high overlap between distributions indicates that poisoned captions remain semantically aligned with the associated images, suggesting that retrieval failures are not due to degraded image–text compatibility.
  • Figure 5: Illustrative example of a MEPA metadata poisoning attack. The left panel shows the image pool associated with the query. The right panel shows the text pool, where the injected poisoned caption (in red) introduces incorrect but plausible information. Despite visually grounded evidence, the model retrieves and incorporates the poisoned caption, producing an incorrect answer.