Table of Contents
Fetching ...

VENOMREC: Cross-Modal Interactive Poisoning for Targeted Promotion in Multimodal LLM Recommender Systems

Guowei Guan, Yurong Hao, Jiaming Zhang, Tiantong Wu, Fuyao Zhang, Tianxiang Chen, Longtao Huang, Cyril Leung, Wei Yang Bryan Lim

TL;DR

The paper addresses the security of Multimodal LLM‑based recommender systems by revealing a cross‑modal poisoning vulnerability that arises when an attacker coordinates perturbations across both text and image inputs. It introduces VenomRec, a two‑stage attack comprising Exposure Alignment to identify a high‑exposure semantic hotspot and Cross‑modal Interactive Perturbation that uses cross‑modal attention to co‑adapt text and visuals toward that hotspot. Through extensive experiments on three real‑world datasets, VenomRec achieves strong targeted promotion (mean ER@20 ≈ 0.73) while preserving benign utility and demonstrating robustness in zero‑shot settings; ablations show the critical role of the interactive co‑adaptation loop. The work highlights a security paradox: cross‑modal consensus can reinforce robustness against unimodal noise yet be exploited to amplify malicious signals, underscoring the need for defenses that account for cross‑modal adversarial alignment and semantic steering.

Abstract

Multimodal large language models (MLLMs) are pushing recommender systems (RecSys) toward content-grounded retrieval and ranking via cross-modal fusion. We find that while cross-modal consensus often mitigates conventional poisoning that manipulates interaction logs or perturbs a single modality, it also introduces a new attack surface where synchronised multimodal poisoning can reliably steer fused representations along stable semantic directions during fine-tuning. To characterise this threat, we formalise cross-modal interactive poisoning and propose VENOMREC, which performs Exposure Alignment to identify high-exposure regions in the joint embedding space and Cross-modal Interactive Perturbation to craft attention-guided coupled token-patch edits. Experiments on three real-world multimodal datasets demonstrate that VENOMREC consistently outperforms strong baselines, achieving 0.73 mean ER@20 and improving over the strongest baseline by +0.52 absolute ER points on average, while maintaining comparable recommendation utility.

VENOMREC: Cross-Modal Interactive Poisoning for Targeted Promotion in Multimodal LLM Recommender Systems

TL;DR

The paper addresses the security of Multimodal LLM‑based recommender systems by revealing a cross‑modal poisoning vulnerability that arises when an attacker coordinates perturbations across both text and image inputs. It introduces VenomRec, a two‑stage attack comprising Exposure Alignment to identify a high‑exposure semantic hotspot and Cross‑modal Interactive Perturbation that uses cross‑modal attention to co‑adapt text and visuals toward that hotspot. Through extensive experiments on three real‑world datasets, VenomRec achieves strong targeted promotion (mean ER@20 ≈ 0.73) while preserving benign utility and demonstrating robustness in zero‑shot settings; ablations show the critical role of the interactive co‑adaptation loop. The work highlights a security paradox: cross‑modal consensus can reinforce robustness against unimodal noise yet be exploited to amplify malicious signals, underscoring the need for defenses that account for cross‑modal adversarial alignment and semantic steering.

Abstract

Multimodal large language models (MLLMs) are pushing recommender systems (RecSys) toward content-grounded retrieval and ranking via cross-modal fusion. We find that while cross-modal consensus often mitigates conventional poisoning that manipulates interaction logs or perturbs a single modality, it also introduces a new attack surface where synchronised multimodal poisoning can reliably steer fused representations along stable semantic directions during fine-tuning. To characterise this threat, we formalise cross-modal interactive poisoning and propose VENOMREC, which performs Exposure Alignment to identify high-exposure regions in the joint embedding space and Cross-modal Interactive Perturbation to craft attention-guided coupled token-patch edits. Experiments on three real-world multimodal datasets demonstrate that VENOMREC consistently outperforms strong baselines, achieving 0.73 mean ER@20 and improving over the strongest baseline by +0.52 absolute ER points on average, while maintaining comparable recommendation utility.
Paper Structure (39 sections, 8 equations, 3 figures, 3 tables)

This paper contains 39 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of poisoning paradigms on MLLM-RecSys. Top: Interaction-level attacks manipulate discrete user-item records (e.g., clicks/ratings) but have limited targeted influence against MLLM-RecSys because they do not directly steer the model's semantic reasoning process. Middle: Single-modality attacks perturb either text or image in isolation and are often mitigated by cross-modal consensus in the fusion mechanism, where the unperturbed modality anchors the final semantics. Bottom: Our VenomRec exploits MLLM-RecSys by crafting coupled cross-modal perturbations guided by cross-modal attention.
  • Figure 2: The schematic overview of VenomRec. The framework operates in two strategic phases: (1) Exposure Alignment (EA) constructs a latent target centroid from high-exposure anchors to guide the attack direction. (2) Cross-modal Interactive Perturbation (CIP) leverages attention-guided saliency to optimise visual features and textual tokens iteratively.
  • Figure 3: Impact of the compromised user ratio $\rho$ on attack effectiveness (ER@5) and recommendation utility (HR@5, NDCG@5) under Few-Shot and Zero-Shot settings on the Clothing dataset.

Theorems & Definitions (1)

  • Definition 3.1: Cross-modal Interactive Poisoning