Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion
Longhui Ma, Di Zhao, Siwei Wang, Zhao Lv, Miao Wang
TL;DR
Trifuse tackles the data-hungry problem of grounding natural language instructions to GUI elements by fusing three complementary modalities at inference: MLLM attention, OCR-derived text cues, and icon-caption semantics. Through a Consensus-SinglePeak fusion strategy, it jointly enforces cross-modal agreement while preserving modality-specific discriminative peaks, and a two-stage localization scheme enhances spatial precision on high-resolution screens without any GUI-specific fine-tuning. Extensive experiments across four grounding benchmarks show that Trifuse closes much of the gap to supervised fine-tuning and RL approaches while requiring no task-specific annotated data, with ablations confirming the value of OCR and caption cues and the effectiveness of token/head filtering and the CS fusion design. This data-efficient, modular framework demonstrates strong generalization across platforms and layouts, highlighting a practical path toward scalable GUI agents that do not depend on large GUI-grounding datasets.
Abstract
GUI grounding maps natural language instructions to the correct interface elements, serving as the perception foundation for GUI agents. Existing approaches predominantly rely on fine-tuning multimodal large language models (MLLMs) using large-scale GUI datasets to predict target element coordinates, which is data-intensive and generalizes poorly to unseen interfaces. Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning, but suffer from low reliability due to the lack of explicit and complementary spatial anchors in GUI images. To address this limitation, we propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors. Trifuse integrates attention, OCR-derived textual cues, and icon-level caption semantics via a Consensus-SinglePeak (CS) fusion strategy that enforces cross-modal agreement while retaining sharp localization peaks. Extensive evaluations on four grounding benchmarks demonstrate that Trifuse achieves strong performance without task-specific fine-tuning, substantially reducing the reliance on expensive annotated data. Moreover, ablation studies reveal that incorporating OCR and caption cues consistently improves attention-based grounding performance across different backbones, highlighting its effectiveness as a general framework for GUI grounding.
