Table of Contents
Fetching ...

Cross-Modal Content Optimization for Steering Web Agent Preferences

Tanqiu Jiang, Min Bai, Nikolaos Pappas, Yanjun Qi, Sandesh Swamy

TL;DR

Cross-Modal Preference Steering (CPS) demonstrates that black-box attackers can jointly manipulate visual and textual inputs to steer VLM-based web agents toward target items. By combining ensemble CLIP-based visual perturbations with iterative textual refinements exploiting RLHF biases, CPS achieves high manipulation rates while maintaining stealth against detectors. The approach is validated on movie recommendation and shopping tasks across multiple backbones (GPT-4.1, Qwen-2.5VL, Pixtral-Large), revealing significant defense gaps as detection remains challenging. The work underscores the need for robust, multi-modal defenses and architectural safeguards to ensure trustworthy, high-stakes agent decisions in real-world web ecosystems.

Abstract

Vision-language model (VLM)-based web agents increasingly power high-stakes selection tasks like content recommendation or product ranking by combining multimodal perception with preference reasoning. Recent studies reveal that these agents are vulnerable against attackers who can bias selection outcomes through preference manipulations using adversarial pop-ups, image perturbations, or content tweaks. Existing work, however, either assumes strong white-box access, with limited single-modal perturbations, or uses impractical settings. In this paper, we demonstrate, for the first time, that joint exploitation of visual and textual channels yields significantly more powerful preference manipulations under realistic attacker capabilities. We introduce Cross-Modal Preference Steering (CPS) that jointly optimizes imperceptible modifications to an item's visual and natural language descriptions, exploiting CLIP-transferable image perturbations and RLHF-induced linguistic biases to steer agent decisions. In contrast to prior studies that assume gradient access, or control over webpages, or agent memory, we adopt a realistic black-box threat setup: a non-privileged adversary can edit only their own listing's images and textual metadata, with no insight into the agent's model internals. We evaluate CPS on agents powered by state-of-the-art proprietary and open source VLMs including GPT-4.1, Qwen-2.5VL and Pixtral-Large on both movie selection and e-commerce tasks. Our results show that CPS is significantly more effective than leading baseline methods. For instance, our results show that CPS consistently outperforms baselines across all models while maintaining 70% lower detection rates, demonstrating both effectiveness and stealth. These findings highlight an urgent need for robust defenses as agentic systems play an increasingly consequential role in society.

Cross-Modal Content Optimization for Steering Web Agent Preferences

TL;DR

Cross-Modal Preference Steering (CPS) demonstrates that black-box attackers can jointly manipulate visual and textual inputs to steer VLM-based web agents toward target items. By combining ensemble CLIP-based visual perturbations with iterative textual refinements exploiting RLHF biases, CPS achieves high manipulation rates while maintaining stealth against detectors. The approach is validated on movie recommendation and shopping tasks across multiple backbones (GPT-4.1, Qwen-2.5VL, Pixtral-Large), revealing significant defense gaps as detection remains challenging. The work underscores the need for robust, multi-modal defenses and architectural safeguards to ensure trustworthy, high-stakes agent decisions in real-world web ecosystems.

Abstract

Vision-language model (VLM)-based web agents increasingly power high-stakes selection tasks like content recommendation or product ranking by combining multimodal perception with preference reasoning. Recent studies reveal that these agents are vulnerable against attackers who can bias selection outcomes through preference manipulations using adversarial pop-ups, image perturbations, or content tweaks. Existing work, however, either assumes strong white-box access, with limited single-modal perturbations, or uses impractical settings. In this paper, we demonstrate, for the first time, that joint exploitation of visual and textual channels yields significantly more powerful preference manipulations under realistic attacker capabilities. We introduce Cross-Modal Preference Steering (CPS) that jointly optimizes imperceptible modifications to an item's visual and natural language descriptions, exploiting CLIP-transferable image perturbations and RLHF-induced linguistic biases to steer agent decisions. In contrast to prior studies that assume gradient access, or control over webpages, or agent memory, we adopt a realistic black-box threat setup: a non-privileged adversary can edit only their own listing's images and textual metadata, with no insight into the agent's model internals. We evaluate CPS on agents powered by state-of-the-art proprietary and open source VLMs including GPT-4.1, Qwen-2.5VL and Pixtral-Large on both movie selection and e-commerce tasks. Our results show that CPS is significantly more effective than leading baseline methods. For instance, our results show that CPS consistently outperforms baselines across all models while maintaining 70% lower detection rates, demonstrating both effectiveness and stealth. These findings highlight an urgent need for robust defenses as agentic systems play an increasingly consequential role in society.

Paper Structure

This paper contains 44 sections, 14 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: CPS attack on web agent selection. Top: benign scenario with random selection. Bottom: adversary jointly perturbs thumbnails and text to steer the agent toward selecting the targeted item.
  • Figure 2: Effect of PGD adversarial perturbation on image understanding.Top: original (left) and PGD-perturbed (right) images. Bottom: outputs from a black-box vision–language model (GPT-4o). On the clean image, the model correctly identifies a partially eaten apple and a flip phone; after PGD, the model misperceives the fruit as an orange, illustrating semantically consequential drift induced by the perturbation.
  • Figure 3: A figure demonstrating how the texts are refined iteratively to successfully attract the web agent's preferences.
  • Figure 4: Agent input structure. The raw UI (top) and OmniParser-labeled view (middle) are vertically stacked and strictly size-capped. Bottom: a small, curated subset of parsed elements (4 early IDs, then an ellipsis, then 4 later IDs), showing type, interactivity, normalized bounding boxes, and raw content snippets.