BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
Shaojie Zhang, Ruoceng Zhang, Pei Fu, Shaokang Wang, Jiahui Yang, Xin Du, Shiqi Cui, Bin Qin, Ying Huang, Zhenbo Luo, Jian Luan
TL;DR
The paper tackles the mismatch between current GUI agents and natural human-GUI interaction by introducing Blink-Think-Link (BTL), a brain-inspired framework that interleaves rapid visual localization (Blink), higher-level reasoning (Think), and precise motor actions (Link). It advances GUI agent learning with two innovations: Blink Data Generation, which creates ROI annotations for training, and BTL Reward, a Process-Outcome Integrated reward that combines format, blink, and link supervision within a reinforcement-learning loop optimized by Group Relative Policy Optimization (GRPO). The authors implement BTL-UI, built on Qwen2.5-VL backbones, and demonstrate state-of-the-art or competitive grounding and planning performance across multiple benchmarks (ScreenSpot, ScreenSpot-V2, ScreenSpot-Pro, AndroidControl, GUI-Odyssey) using a mix of grounding and planning data. The work suggests that integrating process-oriented signals with outcome-focused feedback yields more robust, generalizable GUI agents and lays groundwork for broader human-computer interaction applications.
Abstract
In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward -- the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates competitive performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI Agents.
