Table of Contents
Fetching ...

MVP: Multiple View Prediction Improves GUI Grounding

Yunzhu Zhang, Zeyu Pan, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu

TL;DR

This work tackles the instability of GUI grounding under visual perturbations by introducing MVP, a training-free framework that aggregates predictions from multiple carefully cropped views. MVP combines Attention-Guided View Proposal to generate diverse, informative crops and Multi-Coordinate Clustering to identify spatially consistent predictions, outputting the centroid of the largest agreement region. Across diverse LVLM-based grounding models and benchmarks (ScreenSpot-Pro, OS-World-G, UI-Vision), MVP yields consistent, substantial improvements, including state-of-the-art gains on several closed- and open-source models. The approach requires no retraining and demonstrates strong generalization to high-resolution screens and small UI elements, suggesting wide applicability for robust GUI grounding in practical agents.

Abstract

GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant coordinate prediction instability, minor visual perturbations (e.g. cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with high-resolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1) Attention-Guided View Proposal, which derives diverse views guided by instruction-to-image attention scores, and (2) Multi-Coordinates Clustering, which ensembles predictions by selecting the centroid of the densest spatial cluster. Extensive experiments demonstrate MVP's effectiveness across various models and benchmarks. Notably, on ScreenSpot-Pro, MVP boosts UI-TARS-1.5-7B to 56.1%, GTA1-7B to 61.7%, Qwen3VL-8B-Instruct to 65.3%, and Qwen3VL-32B-Instruct to 74.0%. The code is available at https://github.com/ZJUSCL/MVP.

MVP: Multiple View Prediction Improves GUI Grounding

TL;DR

This work tackles the instability of GUI grounding under visual perturbations by introducing MVP, a training-free framework that aggregates predictions from multiple carefully cropped views. MVP combines Attention-Guided View Proposal to generate diverse, informative crops and Multi-Coordinate Clustering to identify spatially consistent predictions, outputting the centroid of the largest agreement region. Across diverse LVLM-based grounding models and benchmarks (ScreenSpot-Pro, OS-World-G, UI-Vision), MVP yields consistent, substantial improvements, including state-of-the-art gains on several closed- and open-source models. The approach requires no retraining and demonstrates strong generalization to high-resolution screens and small UI elements, suggesting wide applicability for robust GUI grounding in practical agents.

Abstract

GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant coordinate prediction instability, minor visual perturbations (e.g. cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with high-resolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1) Attention-Guided View Proposal, which derives diverse views guided by instruction-to-image attention scores, and (2) Multi-Coordinates Clustering, which ensembles predictions by selecting the centroid of the densest spatial cluster. Extensive experiments demonstrate MVP's effectiveness across various models and benchmarks. Notably, on ScreenSpot-Pro, MVP boosts UI-TARS-1.5-7B to 56.1%, GTA1-7B to 61.7%, Qwen3VL-8B-Instruct to 65.3%, and Qwen3VL-32B-Instruct to 74.0%. The code is available at https://github.com/ZJUSCL/MVP.

Paper Structure

This paper contains 29 sections, 8 equations, 8 figures, 9 tables, 2 algorithms.

Figures (8)

  • Figure 1: (a) An example of model's prediction instability from ScreenSpot-Pro. The instruction is "save image in a specific format". Slightly shifting the screenshot causes significantly different predicted coordinates. (b) We crop different views from the original screenshots in ScreenSpot-Pro and then perform inference separately on them using GTA1-7B. The pass@N accuracy improves with number of views increasing, indicting the model possessing the ability to predict the correct prediction. (c) Our MVP significantly improves performance of different architectures and sizes grounding models by aggregating results of different views.
  • Figure 2: We evaluate model instability by adding a 28-pixel border to ScreenSpot-Pro images and performing separate inference runs with GTA1-7B. (a) This minor visual perturbation causes 7.3% of originally correct predictions to become incorrect, and 7.8% of originally wrong predictions to become correct, revealing high sensitivity to input variations. (b) When analyzing the distance between the two predicted coordinates grouped by image resolution, we observe that instability increases significantly with higher resolutions. (c) Similarly, when grouping by the area of the target region, we find that instability is more pronounced for smaller UI elements.
  • Figure 3: Overview of our Multiple View Prediction (MVP) pipeline, which consists of two main stages: Attention-Guided View Proposal and Multi-Coordinate Clustering. First, MVP takes the user instruction and screenshot, forwarding through the language model to derive attention scores from the instruction to each visual token. Then the top-k scores tokens are selected, and an h×w sub-region is cropped around the center of each corresponding visual patch. These sub-regions are ranked by the number of top-k tokens they contain. The top-m regions are chosen and enlarged to form the final set of views. The model independently predicts coordinates for each view. Finally, MVP aggregates all the predictions by clustering the coordinates based on spatial proximity and outputs the center of the largest cluster as the final prediction.
  • Figure 4: We evaluate GTA1-7B on ScreenSpot-Pro under different view number. Increasing the number of views does not consistently lead to performance improvements. We suggest 4 views as the optimal configuration
  • Figure 5: Example of annotated image. We annotate 2-4 visible red dots with corresponding numerical label for each sample. The model is trained to directly output the correct label.
  • ...and 3 more figures