Table of Contents
Fetching ...

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks

Zongru Wu, Pengzhou Cheng, Zheng Wu, Tianjie Ju, Zhuosheng Zhang, Gongshen Liu

TL;DR

Resource-constrained MLLM GUI agents suffer from a gap between coordinate grounding and action-oriented reasoning. The authors propose query inference as a query-oriented pivot that infers user intents from coordinates and screenshots, formalizing the bridge between $\mathcal{G}$ and $\mathcal{R}$, and enabling a three-step pipeline with $\mathcal{M}_{r}$ and $\mathcal{M}_{g}$ to produce data for reasoning SFT. Experiments show that query inference outperforms grounding at the same data scale and can match large-scale grounding baselines with under 0.1% of training data, with further gains when combined with CoT-based reasoning and additional semantic inputs. This approach offers a data-efficient path to robust GUI reasoning in resource-limited settings, with practical implications for personalized agents and on-device GUI navigation.

Abstract

Perception-enhanced pre-training, particularly through grounding techniques, is widely adopted to enhance the performance of graphical user interface (GUI) agents. However, in resource-constrained scenarios, the format discrepancy between coordinate-oriented grounding and action-oriented reasoning limits the effectiveness of grounding for reasoning tasks. To address this challenge, we propose a query-oriented pivot approach called query inference, which serves as a bridge between GUI grounding and reasoning. By inferring potential user queries from a screenshot and its associated element coordinates, query inference improves the understanding of coordinates while aligning more closely with reasoning tasks. Experimental results show that query inference outperforms previous grounding techniques under the same training data scale. Notably, query inference achieves comparable or even better performance to large-scale grounding-enhanced OS-Atlas with less than 0.1% of training data. Furthermore, we explore the impact of reasoning formats and demonstrate that integrating additional semantic information into the input further boosts reasoning performance. The code is publicly available at https://github.com/ZrW00/GUIPivot.

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks

TL;DR

Resource-constrained MLLM GUI agents suffer from a gap between coordinate grounding and action-oriented reasoning. The authors propose query inference as a query-oriented pivot that infers user intents from coordinates and screenshots, formalizing the bridge between and , and enabling a three-step pipeline with and to produce data for reasoning SFT. Experiments show that query inference outperforms grounding at the same data scale and can match large-scale grounding baselines with under 0.1% of training data, with further gains when combined with CoT-based reasoning and additional semantic inputs. This approach offers a data-efficient path to robust GUI reasoning in resource-limited settings, with practical implications for personalized agents and on-device GUI navigation.

Abstract

Perception-enhanced pre-training, particularly through grounding techniques, is widely adopted to enhance the performance of graphical user interface (GUI) agents. However, in resource-constrained scenarios, the format discrepancy between coordinate-oriented grounding and action-oriented reasoning limits the effectiveness of grounding for reasoning tasks. To address this challenge, we propose a query-oriented pivot approach called query inference, which serves as a bridge between GUI grounding and reasoning. By inferring potential user queries from a screenshot and its associated element coordinates, query inference improves the understanding of coordinates while aligning more closely with reasoning tasks. Experimental results show that query inference outperforms previous grounding techniques under the same training data scale. Notably, query inference achieves comparable or even better performance to large-scale grounding-enhanced OS-Atlas with less than 0.1% of training data. Furthermore, we explore the impact of reasoning formats and demonstrate that integrating additional semantic information into the input further boosts reasoning performance. The code is publicly available at https://github.com/ZrW00/GUIPivot.

Paper Structure

This paper contains 29 sections, 6 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Illustration of grounding, query inference, and reasoning. Grounding identifies coordinates for the queries, while reasoning predicts the actions to achieve the goal. Query inference deduces the intended user queries for the action coordinates, serving as the pivot approach to smooth grounding and reasoning.
  • Figure 2: Three-step Pipeline for constructing samples for query inference. First, we utilize proprietary MLLMs to refine low-level unintended queries into intented formated queries based on corresponding coordinates and screenshots. Second, We utilize proprietary MLLMs for re-grounding based on the refined queries. Finally, we analyze the accuracy of the predicted coordinates to decide whether to save the sample.
  • Figure 3: The overall action prediction performance on AITZ when trained with grounding, query inference as the alternative task, and query inference as the pivot task across various data scales.
  • Figure 4: Example triplets $\langle s, q_r, c \rangle$ from the refined UIBERT dataset, along with the original query $q$.
  • Figure 5: The prompt template for query refinement.
  • ...and 5 more figures