Table of Contents
Fetching ...

Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions

Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao

TL;DR

The paper investigates how multimodal GUI agents remain faithful to user goals when exposed to non-malicious yet distracting environmental content. It introduces a distraction-rich dataset with four scenarios and three perception patterns, and evaluates ten MLLMs to reveal widespread susceptibility to distractions that affect both faithfulness and usefulness. A风险 adversarial perspective demonstrates the feasibility of environmental injection, while a input-channel alignment approach (DPO) shows partial improvements in faithfulness. The study highlights the critical need for faithfulness-focused strategies and future work on visual-semantic grounding and self-correction to enable reliable real-world deployment of multimodal GUI agents.

Abstract

This paper investigates the faithfulness of multimodal large language model (MLLM) agents in a graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by environmental context. A general scenario is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content. A wide range of MLLMs are evaluated as GUI agents using a simulated dataset, following three working patterns with different levels of perception. Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions. While recent studies predominantly focus on the helpfulness of agents, our findings first indicate that these agents are prone to environmental distractions. Furthermore, we implement an adversarial environment injection and analyze the approach to improve faithfulness, calling for a collective focus on this important topic.

Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions

TL;DR

The paper investigates how multimodal GUI agents remain faithful to user goals when exposed to non-malicious yet distracting environmental content. It introduces a distraction-rich dataset with four scenarios and three perception patterns, and evaluates ten MLLMs to reveal widespread susceptibility to distractions that affect both faithfulness and usefulness. A风险 adversarial perspective demonstrates the feasibility of environmental injection, while a input-channel alignment approach (DPO) shows partial improvements in faithfulness. The study highlights the critical need for faithfulness-focused strategies and future work on visual-semantic grounding and self-correction to enable reliable real-world deployment of multimodal GUI agents.

Abstract

This paper investigates the faithfulness of multimodal large language model (MLLM) agents in a graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by environmental context. A general scenario is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content. A wide range of MLLMs are evaluated as GUI agents using a simulated dataset, following three working patterns with different levels of perception. Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions. While recent studies predominantly focus on the helpfulness of agents, our findings first indicate that these agents are prone to environmental distractions. Furthermore, we implement an adversarial environment injection and analyze the approach to improve faithfulness, calling for a collective focus on this important topic.
Paper Structure (31 sections, 11 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 31 sections, 11 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a) Previous studies expect agents to work normally and improve the action prediction performance (e.g., yang2023appagent, zhang2023you). (b) Recent works have discussed that agents can be influenced by ambiguous instructions or malicious inputs (e.g., ruan2024identifying). (c) We focus on the distractions from the environment. The agent is affected when it is perceiving the environment. These distractions (e.g., coupons) are irrelevant to the user's goal and can mislead the agent's action prediction.
  • Figure 2: Overview of our work for distracting GUI agents. We first construct environment status with distractions (the left part), then implement working patterns with prompts (the middle part), and evaluate a broad range of multimodal agents, judging the predicted action as gold, distracted, and invalid (the right part).
  • Figure 3: Examples of simulated data.
  • Figure 4: Illustration of scenario features.