Table of Contents
Fetching ...

Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents

Lingzhong Dong, Ziqi Zhou, Shuaibo Yang, Haiyue Sheng, Pengzhou Cheng, Zongru Wu, Zheng Wu, Gongshen Liu, Zhuosheng Zhang

TL;DR

This work introduces Ground-Truth Alignment (GTA), a principled metric to diagnose whether a vision-language model-powered mobile-use agent's chain-of-thought (CoT) truly implies the ground-truth action. By combining GTA with Exact Match (EM), the authors establish a four-quadrant framework to separate reasoning accuracy from execution accuracy and reveal two key failure modes: Execution Gap ($ ext{EG}$) and Reasoning Gap ($ ext{RG}$). They develop an automatic GTA Evaluator to map free-form CoTs to actions and validate its reliability against human judgments across three mobile benchmarks (AITZ, CAGUI, AndroidControl). Extensive experiments show reasoning-execution gaps are common, with $ ext{EG}$ dominating and only partial relief from parameter scaling, highlighting the need for grounding-focused improvements for trustworthy mobile-use agents. The framework enables finer diagnostics and supports designing more reliable, user-safe GUI agents in real-world deployments.

Abstract

Mobile-use agents powered by vision-language models (VLMs) have shown great potential in interpreting natural language instructions and generating corresponding actions based on mobile graphical user interface. Recent studies suggest that incorporating chain-of-thought (CoT) reasoning tends to improve the execution accuracy. However, existing evaluations emphasize execution accuracy while neglecting whether CoT reasoning aligns with ground-truth actions. This oversight fails to assess potential reasoning-execution gaps, which in turn foster over-trust: users relying on seemingly plausible CoTs may unknowingly authorize harmful actions, potentially resulting in financial loss or trust crisis. In this work, we introduce a new evaluation framework to diagnose reasoning-execution gaps. At its core lies Ground-Truth Alignment (GTA), which measures whether the action implied by a CoT matches the ground-truth action. By combining GTA with the standard Exact Match (EM) metric, we jointly assess both the reasoning accuracy and execution accuracy. This joint perspective reveals two types of reasoning-execution gaps: (i) Execution Gap (EG), where the reasoning correctly identifies the correct action but execution fails, and (ii) Reasoning Gap (RG), where execution succeeds but reasoning process conflicts with the actual execution. Experimental results across a wide range of mobile interaction tasks reveal that reasoning-execution gaps are prevalent, with execution gaps occurring more frequently than reasoning gaps. Moreover, while scaling up model size reduces the overall gap, sizable execution gaps persist even in the largest models. Further analysis shows that our framework reliably reflects systematic EG/RG patterns in state-of-the-art models. These findings offer concrete diagnostics and support the development of more trustworthy mobile-use agents.

Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents

TL;DR

This work introduces Ground-Truth Alignment (GTA), a principled metric to diagnose whether a vision-language model-powered mobile-use agent's chain-of-thought (CoT) truly implies the ground-truth action. By combining GTA with Exact Match (EM), the authors establish a four-quadrant framework to separate reasoning accuracy from execution accuracy and reveal two key failure modes: Execution Gap () and Reasoning Gap (). They develop an automatic GTA Evaluator to map free-form CoTs to actions and validate its reliability against human judgments across three mobile benchmarks (AITZ, CAGUI, AndroidControl). Extensive experiments show reasoning-execution gaps are common, with dominating and only partial relief from parameter scaling, highlighting the need for grounding-focused improvements for trustworthy mobile-use agents. The framework enables finer diagnostics and supports designing more reliable, user-safe GUI agents in real-world deployments.

Abstract

Mobile-use agents powered by vision-language models (VLMs) have shown great potential in interpreting natural language instructions and generating corresponding actions based on mobile graphical user interface. Recent studies suggest that incorporating chain-of-thought (CoT) reasoning tends to improve the execution accuracy. However, existing evaluations emphasize execution accuracy while neglecting whether CoT reasoning aligns with ground-truth actions. This oversight fails to assess potential reasoning-execution gaps, which in turn foster over-trust: users relying on seemingly plausible CoTs may unknowingly authorize harmful actions, potentially resulting in financial loss or trust crisis. In this work, we introduce a new evaluation framework to diagnose reasoning-execution gaps. At its core lies Ground-Truth Alignment (GTA), which measures whether the action implied by a CoT matches the ground-truth action. By combining GTA with the standard Exact Match (EM) metric, we jointly assess both the reasoning accuracy and execution accuracy. This joint perspective reveals two types of reasoning-execution gaps: (i) Execution Gap (EG), where the reasoning correctly identifies the correct action but execution fails, and (ii) Reasoning Gap (RG), where execution succeeds but reasoning process conflicts with the actual execution. Experimental results across a wide range of mobile interaction tasks reveal that reasoning-execution gaps are prevalent, with execution gaps occurring more frequently than reasoning gaps. Moreover, while scaling up model size reduces the overall gap, sizable execution gaps persist even in the largest models. Further analysis shows that our framework reliably reflects systematic EG/RG patterns in state-of-the-art models. These findings offer concrete diagnostics and support the development of more trustworthy mobile-use agents.

Paper Structure

This paper contains 37 sections, 14 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Four-quadrant diagnostic framework of reasoning–execution gaps. The axes represent reasoning accuracy (GTA) and action accuracy (EM). Q1: Ideal, where both reasoning and action correct; Q2: Execution Gap (EG), where reasoning is correct but execution fails; Q3 Both Wrong, where both reasoning and action are incorrect; Q4: Reasoning Gap (RG), where the action is correct but reasoning fails.
  • Figure 2: Action distributions of the original datasets and the stratified sampled subset. Overall, our sampling procedure preserves the overall distribution of actions while also ensuring that representative minority cases are adequately covered. Left shows the full dataset distributions, while right illustrates the 1,800 sampled instances used for human annotation and agreement analysis.
  • Figure 3: Radar plots show the GTA evaluator accuracy across three models and datasets. Overall, the evaluator achieves consistently high accuracy, with similar performance across models. Accuracy peaks on AndroidControl, while results on CAGUI and AITZ are slightly lower.
  • Figure 4: Spline plots of $\mathrm{GTA}$, $\mathrm{EM}$, and $\mathrm{IDEAL}$. $\mathrm{EM}$ measures execution accuracy, $\mathrm{GTA}$ reflects reasoning accuracy, and $\mathrm{IDEAL}$ means ideal reasoning and execution. By construction, $\mathrm{GTA}-\mathrm{IDEAL}=\mathrm{EG}$ and $\mathrm{EM}-\mathrm{IDEAL}=\mathrm{RG}$. When $\mathrm{GTA}$ lies above $\mathrm{EM}$, it indicates $\mathrm{EG}>\mathrm{RG}$, revealing that the main bottleneck lies in translating correct reasoning into executable actions.
  • Figure 5: Effect of parameter scaling on reasoning--execution gaps in AndroidControl. (a) Positive metrics: EM and GTA, where higher is better. (b) Negative metrics: EG and RG, where lower is better. Orange points denote DPO models and blue points denote SFT models, with point size proportional to parameter scale. Scaling consistently improves EM and GTA while reducing EG and RG, though even the largest (72B) model still exhibits execution gaps above 10%.
  • ...and 4 more figures