Table of Contents
Fetching ...

Attacking Vision-Language Computer Agents via Pop-ups

Yanzhe Zhang, Tao Yu, Diyi Yang

TL;DR

This work demonstrates that vision-language agents operating over GUIs are vulnerable to adversarial pop-ups that are recognizable to humans but can mislead agents into clicking. By designing a four-element pop-up framework (Attention Hook, Instruction, Info Banner, ALT Descriptor) and integrating them into OSWorld and VisualWebArena, the authors reveal high attack success rates and substantial declines in task effectiveness across multiple VLM backbones. Ablation studies show which components drive success (notably attention hooks and ALT descriptors) and reveal that simple defenses are insufficient, prompting exploration of step-wise defenses and broader mitigation strategies. The findings underscore real-world safety risks in autonomous GUI tasks and call for robust grounding, threat-model-aware training, and human-in-the-loop oversight to prevent malicious manipulation of automated agents.

Abstract

Autonomous agents powered by large vision and language models (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing their tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques, such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack.

Attacking Vision-Language Computer Agents via Pop-ups

TL;DR

This work demonstrates that vision-language agents operating over GUIs are vulnerable to adversarial pop-ups that are recognizable to humans but can mislead agents into clicking. By designing a four-element pop-up framework (Attention Hook, Instruction, Info Banner, ALT Descriptor) and integrating them into OSWorld and VisualWebArena, the authors reveal high attack success rates and substantial declines in task effectiveness across multiple VLM backbones. Ablation studies show which components drive success (notably attention hooks and ALT descriptors) and reveal that simple defenses are insufficient, prompting exploration of step-wise defenses and broader mitigation strategies. The findings underscore real-world safety risks in autonomous GUI tasks and call for robust grounding, threat-model-aware training, and human-in-the-loop oversight to prevent malicious manipulation of automated agents.

Abstract

Autonomous agents powered by large vision and language models (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing their tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques, such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack.

Paper Structure

This paper contains 26 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: On average, 92.7% / 73.1% of all actions of attacked agents in OSWorld/VisualWebArena are clicking on the adversarial pop-ups.
  • Figure 2: Adversarial pop-up examples. We highlight the design space of our pop-ups: (1) Attention Hook, (2) Instruction, (3) Info Banner, (4) ALT Descriptor (If the agent framework uses ALT strings in a11y trees).
  • Figure 3: The impact of our attack on how many steps the agent takes. We show the distribution of action steps w/ and w/o our attack, where the y-axis refers to the proportion of tasks. Our attack significantly delays task completion on both benchmarks, causing more tasks to stop only after reaching the step limit. Note that we show results for GPT-4-Turbo on OSWorld (with a 15-step limit) and GPT-4o on VisualWebArena (with a 30-step limit).
  • Figure 4: The correlation between ASR and Task-level Attack Success Rate (TASR) shows that TASR is generally similar to ASR and tends to be higher than ASR when ASR is low.
  • Figure 5: Successfully attacked examples, showing the thoughts generated by original and attack agents. Example 1,2,3 are from OSWorld screen agent, OSWorld SoM agent, and VisualWebArena SoM agent correspondingly.
  • ...and 5 more figures