Table of Contents
Fetching ...

See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

Zongru Wu, Rui Mao, Zhiyuan Tian, Pengzhou Cheng, Tianjie Ju, Zheng Wu, Lingzhong Dong, Haiyue Sheng, Zhuosheng Zhang, Gongshen Liu

TL;DR

A state control benchmark with binary toggle instructions derived from public datasets is constructed and proposed State-aware Reasoning (StaR), a multimodal reasoning method that enables agents to perceive the current toggle state, infer the desired state from the instruction, and act accordingly.

Abstract

The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions derived from public datasets. Evaluation results of existing agents demonstrate their notable unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a multimodal reasoning method that enables agents to perceive the current toggle state, infer the desired state from the instruction, and act accordingly. Experiments on four multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30\%. Further evaluations on three public agentic benchmarks show that StaR also enhances general agentic task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code and benchmark: https://github.com/ZrW00/StaR.

See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

TL;DR

A state control benchmark with binary toggle instructions derived from public datasets is constructed and proposed State-aware Reasoning (StaR), a multimodal reasoning method that enables agents to perceive the current toggle state, infer the desired state from the instruction, and act accordingly.

Abstract

The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions derived from public datasets. Evaluation results of existing agents demonstrate their notable unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a multimodal reasoning method that enables agents to perceive the current toggle state, infer the desired state from the instruction, and act accordingly. Experiments on four multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30\%. Further evaluations on three public agentic benchmarks show that StaR also enhances general agentic task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code and benchmark: https://github.com/ZrW00/StaR.

Paper Structure

This paper contains 32 sections, 14 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Two typical toggle errors and representative toggle types below. (i) Desired state differs from current state, but the agent fails to toggle (false negative); (ii) desired state matches current state, yet agent still toggles (false positive). The bottom row shows representative toggle type: toggle button, switch, and checkbox.
  • Figure 2: Three-step annotation pipeline for constructing the state control benchmark. First, we extract screenshots with widget bounding boxes corresponding to toggle control instructions from public datasets and utilize OminiParser to parse clickable widgets. Second, we leverage Qwen-2-VL-72B and GLM-4V to identify toggles among clickable widgets and establish inter-annotator agreement. Finally, we employ Qwen-2-VL-72B and GLM-4V to annotate toggle state and functionality, ensuring data quality through inter-annotator agreement.
  • Figure 3: Agent performance on the state control benchmark (all metrics are standardized as "higher-is-better"). (a) Proprietary MLLM-based agents. (b) Open-source MLLM-based agents. (c) Open-source MLLM-based agents with prompt engineering. Results show that current agents remain unreliable for toggle control, and prompt engineering offers no fundamental improvement.
  • Figure 4: StaR reasoning chain. StaR simulates human-like reasoning for toggle control by incorporating state-aware reasoning into multimodal agents through three steps: (i) perceive current state, (ii) analyze desired state, and (iii) decide whether to toggle.
  • Figure 5: The performance of zero-shot and StaR-trained UI-TARS-7B on agentic benchmarks. Results demonstrate that StaR consistently preserves or enhances performance on agentic benchmarks and yields notable improvements on complex, long-chain tasks.
  • ...and 13 more figures