Table of Contents
Fetching ...

Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical Study

Li Zhang, Longxi Gao, Mengwei Xu

TL;DR

This study empirically evaluates whether chain-of-thought reasoning improves performance for commercial vision-language models when controlling mobile GUIs. It compares two model pairs (Gemini 2.0 Flash and Claude 3.7 Sonnet) in base and reasoning-enabled forms, against GPT-4o, across static benchmarks ScreenSpot and AndroidControl and the interactive AndroidWorld environment. The findings show that reasoning yields only marginal gains on static tasks and model-specific improvements on AndroidWorld, with substantial inconsistencies and higher token costs. The analysis highlights benchmark limitations and grounding/consistency challenges in reasoning VLMs, arguing for interactive evaluations and adaptive reasoning to make such capabilities practical for mobile GUI agents. The work provides actionable guidance on benchmark design, model training for grounding, and efficiency-aware reasoning for future GUI automation systems.

Abstract

Reasoning capabilities have significantly improved the performance of vision-language models (VLMs) in domains such as mathematical problem-solving, coding, and visual question-answering. However, their impact on real-world applications remains unclear. This paper presents the first empirical study on the effectiveness of reasoning-enabled VLMs in mobile GUI agents, a domain that requires interpreting complex screen layouts, understanding user instructions, and executing multi-turn interactions. We evaluate two pairs of commercial models--Gemini 2.0 Flash and Claude 3.7 Sonnet--comparing their base and reasoning-enhanced versions across two static benchmarks (ScreenSpot and AndroidControl) and one interactive environment (AndroidWorld). We surprisingly find the Claude 3.7 Sonnet reasoning model achieves state-of-the-art performance on AndroidWorld. However, reasoning VLMs generally offer marginal improvements over non-reasoning models on static benchmarks and even degrade performance in some agent setups. Notably, reasoning and non-reasoning VLMs fail on different sets of tasks, suggesting that reasoning does have an impact, but its benefits and drawbacks counterbalance each other. We attribute these inconsistencies to the limitations of benchmarks and VLMs. Based on the findings, we provide insights for further enhancing mobile GUI agents in terms of benchmarks, VLMs, and their adaptability in dynamically invoking reasoning VLMs. The experimental data are publicly available at https://github.com/LlamaTouch/VLM-Reasoning-Traces.

Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical Study

TL;DR

This study empirically evaluates whether chain-of-thought reasoning improves performance for commercial vision-language models when controlling mobile GUIs. It compares two model pairs (Gemini 2.0 Flash and Claude 3.7 Sonnet) in base and reasoning-enabled forms, against GPT-4o, across static benchmarks ScreenSpot and AndroidControl and the interactive AndroidWorld environment. The findings show that reasoning yields only marginal gains on static tasks and model-specific improvements on AndroidWorld, with substantial inconsistencies and higher token costs. The analysis highlights benchmark limitations and grounding/consistency challenges in reasoning VLMs, arguing for interactive evaluations and adaptive reasoning to make such capabilities practical for mobile GUI agents. The work provides actionable guidance on benchmark design, model training for grounding, and efficiency-aware reasoning for future GUI automation systems.

Abstract

Reasoning capabilities have significantly improved the performance of vision-language models (VLMs) in domains such as mathematical problem-solving, coding, and visual question-answering. However, their impact on real-world applications remains unclear. This paper presents the first empirical study on the effectiveness of reasoning-enabled VLMs in mobile GUI agents, a domain that requires interpreting complex screen layouts, understanding user instructions, and executing multi-turn interactions. We evaluate two pairs of commercial models--Gemini 2.0 Flash and Claude 3.7 Sonnet--comparing their base and reasoning-enhanced versions across two static benchmarks (ScreenSpot and AndroidControl) and one interactive environment (AndroidWorld). We surprisingly find the Claude 3.7 Sonnet reasoning model achieves state-of-the-art performance on AndroidWorld. However, reasoning VLMs generally offer marginal improvements over non-reasoning models on static benchmarks and even degrade performance in some agent setups. Notably, reasoning and non-reasoning VLMs fail on different sets of tasks, suggesting that reasoning does have an impact, but its benefits and drawbacks counterbalance each other. We attribute these inconsistencies to the limitations of benchmarks and VLMs. Based on the findings, we provide insights for further enhancing mobile GUI agents in terms of benchmarks, VLMs, and their adaptability in dynamically invoking reasoning VLMs. The experimental data are publicly available at https://github.com/LlamaTouch/VLM-Reasoning-Traces.

Paper Structure

This paper contains 14 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: A demonstration of Gemini 2.0 Flash Thinking's reasoning process for mobile GUI automation tasks. The model first explicitly outlines the task instruction and the observed GUI elements, then reasons through the information to determine the actions. User request to the mobile GUI agent: You need to complete the task "Set my DM Spam filter to 'Do not filter direct messages' on Discord", output possible actions on this GUI that may complete the task. Left: The input mobile GUI (screenshot). Right: VLM's reasoning process and final response (action).
  • Figure 2: Task completion rates on AndroidWorld categorized by task difficulties.
  • Figure 3: An example of a grounding error on ScreenSpot.
  • Figure 4: Comparison of average output token count between the Claude reasoning model and its base model without reasoning. Across all setups, reasoning increases token consumption by at least 3$\times$ compared to the non-reasoning model, resulting in higher monetary costs and increased response latency.
  • Figure 5: Benchmark Error: Weak Evaluation Method.
  • ...and 7 more figures