Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents
Lingzhong Dong, Ziqi Zhou, Shuaibo Yang, Haiyue Sheng, Pengzhou Cheng, Zongru Wu, Zheng Wu, Gongshen Liu, Zhuosheng Zhang
TL;DR
This work introduces Ground-Truth Alignment (GTA), a principled metric to diagnose whether a vision-language model-powered mobile-use agent's chain-of-thought (CoT) truly implies the ground-truth action. By combining GTA with Exact Match (EM), the authors establish a four-quadrant framework to separate reasoning accuracy from execution accuracy and reveal two key failure modes: Execution Gap ($ ext{EG}$) and Reasoning Gap ($ ext{RG}$). They develop an automatic GTA Evaluator to map free-form CoTs to actions and validate its reliability against human judgments across three mobile benchmarks (AITZ, CAGUI, AndroidControl). Extensive experiments show reasoning-execution gaps are common, with $ ext{EG}$ dominating and only partial relief from parameter scaling, highlighting the need for grounding-focused improvements for trustworthy mobile-use agents. The framework enables finer diagnostics and supports designing more reliable, user-safe GUI agents in real-world deployments.
Abstract
Mobile-use agents powered by vision-language models (VLMs) have shown great potential in interpreting natural language instructions and generating corresponding actions based on mobile graphical user interface. Recent studies suggest that incorporating chain-of-thought (CoT) reasoning tends to improve the execution accuracy. However, existing evaluations emphasize execution accuracy while neglecting whether CoT reasoning aligns with ground-truth actions. This oversight fails to assess potential reasoning-execution gaps, which in turn foster over-trust: users relying on seemingly plausible CoTs may unknowingly authorize harmful actions, potentially resulting in financial loss or trust crisis. In this work, we introduce a new evaluation framework to diagnose reasoning-execution gaps. At its core lies Ground-Truth Alignment (GTA), which measures whether the action implied by a CoT matches the ground-truth action. By combining GTA with the standard Exact Match (EM) metric, we jointly assess both the reasoning accuracy and execution accuracy. This joint perspective reveals two types of reasoning-execution gaps: (i) Execution Gap (EG), where the reasoning correctly identifies the correct action but execution fails, and (ii) Reasoning Gap (RG), where execution succeeds but reasoning process conflicts with the actual execution. Experimental results across a wide range of mobile interaction tasks reveal that reasoning-execution gaps are prevalent, with execution gaps occurring more frequently than reasoning gaps. Moreover, while scaling up model size reduces the overall gap, sizable execution gaps persist even in the largest models. Further analysis shows that our framework reliably reflects systematic EG/RG patterns in state-of-the-art models. These findings offer concrete diagnostics and support the development of more trustworthy mobile-use agents.
