Table of Contents
Fetching ...

Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning

Hai-Ming Xu, Qi Chen, Lei Wang, Lingqiao Liu

TL;DR

This work introduces TAG, a tuning-free Attention-driven Grounding framework that exploits the inherent attention patterns of a pretrained Multimodal LLM (MiniCPMV2.5) to ground GUI elements without fine-tuning. TAG combines adaptive text token selection with attention-driven grounding and selective head filtering to map user queries to GUI components via cross- and self-attention maps, achieving strong text localization and competitive GUI grounding across multiple benchmarks. Across OCG, ScreenSpot, Mind2Web agent evaluations, TAG consistently outperforms or matches tuning-based methods while reducing the need for costly fine-tuning and offering better generalization to varied aspect ratios and platforms. The approach demonstrates the untapped potential of pretrained attention mechanisms for scalable GUI automation and suggests broad applicability to other multimodal grounding tasks.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding-accurately identifying critical GUI components such as text or icons based on a GUI image and a corresponding text query. Traditionally, this task has relied on fine-tuning MLLMs with specialized training data to predict component locations directly. However, in this paper, we propose a novel Tuning-free Attention-driven Grounding (TAG) method that leverages the inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning. Our method involves identifying and aggregating attention maps from specific tokens within a carefully constructed query prompt. Applied to MiniCPM-Llama3-V 2.5, a state-of-the-art MLLM, our tuning-free approach achieves performance comparable to tuning-based methods, with notable success in text localization. Additionally, we demonstrate that our attention map-based grounding technique significantly outperforms direct localization predictions from MiniCPM-Llama3-V 2.5, highlighting the potential of using attention maps from pretrained MLLMs and paving the way for future innovations in this domain.

Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning

TL;DR

This work introduces TAG, a tuning-free Attention-driven Grounding framework that exploits the inherent attention patterns of a pretrained Multimodal LLM (MiniCPMV2.5) to ground GUI elements without fine-tuning. TAG combines adaptive text token selection with attention-driven grounding and selective head filtering to map user queries to GUI components via cross- and self-attention maps, achieving strong text localization and competitive GUI grounding across multiple benchmarks. Across OCG, ScreenSpot, Mind2Web agent evaluations, TAG consistently outperforms or matches tuning-based methods while reducing the need for costly fine-tuning and offering better generalization to varied aspect ratios and platforms. The approach demonstrates the untapped potential of pretrained attention mechanisms for scalable GUI automation and suggests broad applicability to other multimodal grounding tasks.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding-accurately identifying critical GUI components such as text or icons based on a GUI image and a corresponding text query. Traditionally, this task has relied on fine-tuning MLLMs with specialized training data to predict component locations directly. However, in this paper, we propose a novel Tuning-free Attention-driven Grounding (TAG) method that leverages the inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning. Our method involves identifying and aggregating attention maps from specific tokens within a carefully constructed query prompt. Applied to MiniCPM-Llama3-V 2.5, a state-of-the-art MLLM, our tuning-free approach achieves performance comparable to tuning-based methods, with notable success in text localization. Additionally, we demonstrate that our attention map-based grounding technique significantly outperforms direct localization predictions from MiniCPM-Llama3-V 2.5, highlighting the potential of using attention maps from pretrained MLLMs and paving the way for future innovations in this domain.

Paper Structure

This paper contains 41 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Illustration of MiniCPMV2.5's strong GUI image understanding but poor element localization. Our attention-driven GUI grounding leverages its inherent attention to enhance localization accuracy without fine-tuning, as shown on the right.
  • Figure 2: Overall pipeline of our TAG approach in Sec. \ref{['subsec:tag_pipeline']} (top) and the self-attention selection module in Sec. \ref{['subsec:attn_filtering']} (bottom).
  • Figure 3: Demonstrating how choosing top self-attention heads improves text-to-image token mapping (see Sec. \ref{['subsec:attn_filtering']} for details).
  • Figure 4: Demonstration of the comparing methods on two cases of ScreenSpot. Our attention-driven grounding with element description success in localizing the text and icon elements respectively. Please zoom in for a better view.
  • Figure 5: Demonstration of our method on Mind2Web to ground precisely at each step and successfully achieve the overall goal. Detailed action history is presented in supplementary materials. Please zoom in for a better view.
  • ...and 5 more figures