Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical Study
Li Zhang, Longxi Gao, Mengwei Xu
TL;DR
This study empirically evaluates whether chain-of-thought reasoning improves performance for commercial vision-language models when controlling mobile GUIs. It compares two model pairs (Gemini 2.0 Flash and Claude 3.7 Sonnet) in base and reasoning-enabled forms, against GPT-4o, across static benchmarks ScreenSpot and AndroidControl and the interactive AndroidWorld environment. The findings show that reasoning yields only marginal gains on static tasks and model-specific improvements on AndroidWorld, with substantial inconsistencies and higher token costs. The analysis highlights benchmark limitations and grounding/consistency challenges in reasoning VLMs, arguing for interactive evaluations and adaptive reasoning to make such capabilities practical for mobile GUI agents. The work provides actionable guidance on benchmark design, model training for grounding, and efficiency-aware reasoning for future GUI automation systems.
Abstract
Reasoning capabilities have significantly improved the performance of vision-language models (VLMs) in domains such as mathematical problem-solving, coding, and visual question-answering. However, their impact on real-world applications remains unclear. This paper presents the first empirical study on the effectiveness of reasoning-enabled VLMs in mobile GUI agents, a domain that requires interpreting complex screen layouts, understanding user instructions, and executing multi-turn interactions. We evaluate two pairs of commercial models--Gemini 2.0 Flash and Claude 3.7 Sonnet--comparing their base and reasoning-enhanced versions across two static benchmarks (ScreenSpot and AndroidControl) and one interactive environment (AndroidWorld). We surprisingly find the Claude 3.7 Sonnet reasoning model achieves state-of-the-art performance on AndroidWorld. However, reasoning VLMs generally offer marginal improvements over non-reasoning models on static benchmarks and even degrade performance in some agent setups. Notably, reasoning and non-reasoning VLMs fail on different sets of tasks, suggesting that reasoning does have an impact, but its benefits and drawbacks counterbalance each other. We attribute these inconsistencies to the limitations of benchmarks and VLMs. Based on the findings, we provide insights for further enhancing mobile GUI agents in terms of benchmarks, VLMs, and their adaptability in dynamically invoking reasoning VLMs. The experimental data are publicly available at https://github.com/LlamaTouch/VLM-Reasoning-Traces.
