Table of Contents
Fetching ...

SlowBA: An efficiency backdoor attack towards VLM-based GUI agents

Junxian Li, Tu Lan, Haozhen Tan, Yan Meng, Haojin Zhu

TL;DR

SlowBA is introduced, a novel backdoor attack that targets the responsiveness of VLM-based GUI agents and proposes a two-stage reward-level backdoor injection (RBI) strategy that first aligns the long-response format and then learns trigger-aware activation through reinforcement learning.

Abstract

Modern vision-language-model (VLM) based graphical user interface (GUI) agents are expected not only to execute actions accurately but also to respond to user instructions with low latency. While existing research on GUI-agent security mainly focuses on manipulating action correctness, the security risks related to response efficiency remain largely unexplored. In this paper, we introduce SlowBA, a novel backdoor attack that targets the responsiveness of VLM-based GUI agents. The key idea is to manipulate response latency by inducing excessively long reasoning chains under specific trigger patterns. To achieve this, we propose a two-stage reward-level backdoor injection (RBI) strategy that first aligns the long-response format and then learns trigger-aware activation through reinforcement learning. In addition, we design realistic pop-up windows as triggers that naturally appear in GUI environments, improving the stealthiness of the attack. Extensive experiments across multiple datasets and baselines demonstrate that SlowBA can significantly increase response length and latency while largely preserving task accuracy. The attack remains effective even with a small poisoning ratio and under several defense settings. These findings reveal a previously overlooked security vulnerability in GUI agents and highlight the need for defenses that consider both action correctness and response efficiency. Code can be found in https://github.com/tu-tuing/SlowBA.

SlowBA: An efficiency backdoor attack towards VLM-based GUI agents

TL;DR

SlowBA is introduced, a novel backdoor attack that targets the responsiveness of VLM-based GUI agents and proposes a two-stage reward-level backdoor injection (RBI) strategy that first aligns the long-response format and then learns trigger-aware activation through reinforcement learning.

Abstract

Modern vision-language-model (VLM) based graphical user interface (GUI) agents are expected not only to execute actions accurately but also to respond to user instructions with low latency. While existing research on GUI-agent security mainly focuses on manipulating action correctness, the security risks related to response efficiency remain largely unexplored. In this paper, we introduce SlowBA, a novel backdoor attack that targets the responsiveness of VLM-based GUI agents. The key idea is to manipulate response latency by inducing excessively long reasoning chains under specific trigger patterns. To achieve this, we propose a two-stage reward-level backdoor injection (RBI) strategy that first aligns the long-response format and then learns trigger-aware activation through reinforcement learning. In addition, we design realistic pop-up windows as triggers that naturally appear in GUI environments, improving the stealthiness of the attack. Extensive experiments across multiple datasets and baselines demonstrate that SlowBA can significantly increase response length and latency while largely preserving task accuracy. The attack remains effective even with a small poisoning ratio and under several defense settings. These findings reveal a previously overlooked security vulnerability in GUI agents and highlight the need for defenses that consider both action correctness and response efficiency. Code can be found in https://github.com/tu-tuing/SlowBA.
Paper Structure (30 sections, 9 equations, 13 figures, 7 tables)

This paper contains 30 sections, 9 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: (left) Figure revealing the scenario of the proposed attack. Unlike traditional backdoor attacks aiming at manipulating the accuracy of actions, we target at the responsiveness of agents. Specifically, we want them to generate responses with very high latency. (right) Comparison of SlowBA with other backdoor attacks towards VLMs & VLM-based GUI agents.
  • Figure 2: Correlation between latency and length.
  • Figure 3: The workflow of SlowBA. Red solid rectangular boxs are used to show the position of triggers. $\mathcal{D}_{tr}^{S}$ and $\mathcal{D}_{tr}^{R}$ denote parts of triggered dataset for SFT and RL.
  • Figure 4: Trigger making process. The left part illustrates the prompt using for extracting domain names (only used for websites) and tools for trigger construction. The right part shows the trigger applied for website pages (top-left), desktop pages (top-right) and app pages (bottom).
  • Figure 5: Avg token lengths.
  • ...and 8 more figures