Table of Contents
Fetching ...

Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion

Longhui Ma, Di Zhao, Siwei Wang, Zhao Lv, Miao Wang

TL;DR

Trifuse tackles the data-hungry problem of grounding natural language instructions to GUI elements by fusing three complementary modalities at inference: MLLM attention, OCR-derived text cues, and icon-caption semantics. Through a Consensus-SinglePeak fusion strategy, it jointly enforces cross-modal agreement while preserving modality-specific discriminative peaks, and a two-stage localization scheme enhances spatial precision on high-resolution screens without any GUI-specific fine-tuning. Extensive experiments across four grounding benchmarks show that Trifuse closes much of the gap to supervised fine-tuning and RL approaches while requiring no task-specific annotated data, with ablations confirming the value of OCR and caption cues and the effectiveness of token/head filtering and the CS fusion design. This data-efficient, modular framework demonstrates strong generalization across platforms and layouts, highlighting a practical path toward scalable GUI agents that do not depend on large GUI-grounding datasets.

Abstract

GUI grounding maps natural language instructions to the correct interface elements, serving as the perception foundation for GUI agents. Existing approaches predominantly rely on fine-tuning multimodal large language models (MLLMs) using large-scale GUI datasets to predict target element coordinates, which is data-intensive and generalizes poorly to unseen interfaces. Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning, but suffer from low reliability due to the lack of explicit and complementary spatial anchors in GUI images. To address this limitation, we propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors. Trifuse integrates attention, OCR-derived textual cues, and icon-level caption semantics via a Consensus-SinglePeak (CS) fusion strategy that enforces cross-modal agreement while retaining sharp localization peaks. Extensive evaluations on four grounding benchmarks demonstrate that Trifuse achieves strong performance without task-specific fine-tuning, substantially reducing the reliance on expensive annotated data. Moreover, ablation studies reveal that incorporating OCR and caption cues consistently improves attention-based grounding performance across different backbones, highlighting its effectiveness as a general framework for GUI grounding.

Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion

TL;DR

Trifuse tackles the data-hungry problem of grounding natural language instructions to GUI elements by fusing three complementary modalities at inference: MLLM attention, OCR-derived text cues, and icon-caption semantics. Through a Consensus-SinglePeak fusion strategy, it jointly enforces cross-modal agreement while preserving modality-specific discriminative peaks, and a two-stage localization scheme enhances spatial precision on high-resolution screens without any GUI-specific fine-tuning. Extensive experiments across four grounding benchmarks show that Trifuse closes much of the gap to supervised fine-tuning and RL approaches while requiring no task-specific annotated data, with ablations confirming the value of OCR and caption cues and the effectiveness of token/head filtering and the CS fusion design. This data-efficient, modular framework demonstrates strong generalization across platforms and layouts, highlighting a practical path toward scalable GUI agents that do not depend on large GUI-grounding datasets.

Abstract

GUI grounding maps natural language instructions to the correct interface elements, serving as the perception foundation for GUI agents. Existing approaches predominantly rely on fine-tuning multimodal large language models (MLLMs) using large-scale GUI datasets to predict target element coordinates, which is data-intensive and generalizes poorly to unseen interfaces. Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning, but suffer from low reliability due to the lack of explicit and complementary spatial anchors in GUI images. To address this limitation, we propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors. Trifuse integrates attention, OCR-derived textual cues, and icon-level caption semantics via a Consensus-SinglePeak (CS) fusion strategy that enforces cross-modal agreement while retaining sharp localization peaks. Extensive evaluations on four grounding benchmarks demonstrate that Trifuse achieves strong performance without task-specific fine-tuning, substantially reducing the reliance on expensive annotated data. Moreover, ablation studies reveal that incorporating OCR and caption cues consistently improves attention-based grounding performance across different backbones, highlighting its effectiveness as a general framework for GUI grounding.
Paper Structure (38 sections, 13 equations, 9 figures, 11 tables)

This paper contains 38 sections, 13 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Comparison between training-based and attention-based methods for GUI grounding.
  • Figure 2: Overview of our Trifuse framework. Trifuse consists of three main components: (1) a modality extraction module that derives complementary grounding cues, including attention-based signals from MLLMs, textual cues from OCR, and icon-level visual semantics from captioning; (2) a Consensus-SinglePeak (CS) fusion module that integrates these modality-specific heatmaps by jointly modeling cross-modal agreement and modality-specific discriminative peaks; and (3) a two-stage localization module that progressively refines the fused grounding map through cropping and zoom-in operations to accurately identify the target GUI element.
  • Figure 3: Ablation studies of top-$k$ selection strategies on attention modality.
  • Figure 4: System prompt used for ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro evaluation.
  • Figure 5: System prompt used for OSWorld-G evaluation.
  • ...and 4 more figures