Table of Contents
Fetching ...

Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements

Ziwei Liu, Tao Feng, Borui Kang, Yanbing Yang, Jun Luo

Abstract

Multimodal Large Language Model (MLLM)-based Graphical User Interface (GUI) agents develop rapidly, with visual grounding that maps natural language instructions to target UI elements serving as the core capability. Existing GUI agents typically fine-tune MLLM on massive datasets to handle challenges in understanding instructions and UI interfaces, which not only incurs high data annotation costs but also makes performance dependent on data quality and distribution. To avoid such cumbersome yet ineffective training, we notice that complex UI interfaces can be decomposed into basic visual elements directly understandable by common MLLMs. Consequently, we propose ZoomUI that leverages inference scaling to guide common MLLMs in progressively anchor instruction elements to increasingly detailed interface elements. Specifically, ZoomUI first optimizes the latent thinking to transform original instruction into element visual features description, and subsequently leverages internal attention to iteratively zoom in target element interface region. Evaluations on extensive benchmarks demonstrate that ZoomUI reaches or even surpasses SOTA baselines.

Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements

Abstract

Multimodal Large Language Model (MLLM)-based Graphical User Interface (GUI) agents develop rapidly, with visual grounding that maps natural language instructions to target UI elements serving as the core capability. Existing GUI agents typically fine-tune MLLM on massive datasets to handle challenges in understanding instructions and UI interfaces, which not only incurs high data annotation costs but also makes performance dependent on data quality and distribution. To avoid such cumbersome yet ineffective training, we notice that complex UI interfaces can be decomposed into basic visual elements directly understandable by common MLLMs. Consequently, we propose ZoomUI that leverages inference scaling to guide common MLLMs in progressively anchor instruction elements to increasingly detailed interface elements. Specifically, ZoomUI first optimizes the latent thinking to transform original instruction into element visual features description, and subsequently leverages internal attention to iteratively zoom in target element interface region. Evaluations on extensive benchmarks demonstrate that ZoomUI reaches or even surpasses SOTA baselines.
Paper Structure (30 sections, 10 equations, 10 figures, 11 tables)

This paper contains 30 sections, 10 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Performance comparisons of ZoomUI and other SOTA baseline methods.
  • Figure 1: Parameter sensitivity of instruction refinement across four benchmarks.
  • Figure 2: Overview Workflow. ZoomUI initiates by refining the original instruction to a visual features description of element. Subsequently, by capturing attention distribution during coordinate generation phase, it iteratively zooms into relevant regions to obtain more a fine-grained interface of target UI element.
  • Figure 2: Visualization comparison. (a) is a UI interface with instruction: "Check Stats of NFL teams", red block represents the ground truth region. Yellow blocks in attention maps of (b) and (c) are zoom-in regions that refer to the highest attention scores.
  • Figure 3: Instruction Refinement. We introduce learnable thought vectors injected in the interface and prompt embeddings. These vectors are iteratively optimized by maximizing the likelihood of the output logits via gradient ascent, which steers the latent representations to generate more reliable refined instructions.
  • ...and 5 more figures