Table of Contents
Fetching ...

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

Shaojie Zhang, Ruoceng Zhang, Pei Fu, Shaokang Wang, Jiahui Yang, Xin Du, Shiqi Cui, Bin Qin, Ying Huang, Zhenbo Luo, Jian Luan

TL;DR

The paper tackles the mismatch between current GUI agents and natural human-GUI interaction by introducing Blink-Think-Link (BTL), a brain-inspired framework that interleaves rapid visual localization (Blink), higher-level reasoning (Think), and precise motor actions (Link). It advances GUI agent learning with two innovations: Blink Data Generation, which creates ROI annotations for training, and BTL Reward, a Process-Outcome Integrated reward that combines format, blink, and link supervision within a reinforcement-learning loop optimized by Group Relative Policy Optimization (GRPO). The authors implement BTL-UI, built on Qwen2.5-VL backbones, and demonstrate state-of-the-art or competitive grounding and planning performance across multiple benchmarks (ScreenSpot, ScreenSpot-V2, ScreenSpot-Pro, AndroidControl, GUI-Odyssey) using a mix of grounding and planning data. The work suggests that integrating process-oriented signals with outcome-focused feedback yields more robust, generalizable GUI agents and lays groundwork for broader human-computer interaction applications.

Abstract

In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward -- the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates competitive performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI Agents.

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

TL;DR

The paper tackles the mismatch between current GUI agents and natural human-GUI interaction by introducing Blink-Think-Link (BTL), a brain-inspired framework that interleaves rapid visual localization (Blink), higher-level reasoning (Think), and precise motor actions (Link). It advances GUI agent learning with two innovations: Blink Data Generation, which creates ROI annotations for training, and BTL Reward, a Process-Outcome Integrated reward that combines format, blink, and link supervision within a reinforcement-learning loop optimized by Group Relative Policy Optimization (GRPO). The authors implement BTL-UI, built on Qwen2.5-VL backbones, and demonstrate state-of-the-art or competitive grounding and planning performance across multiple benchmarks (ScreenSpot, ScreenSpot-V2, ScreenSpot-Pro, AndroidControl, GUI-Odyssey) using a mix of grounding and planning data. The work suggests that integrating process-oriented signals with outcome-focused feedback yields more robust, generalizable GUI agents and lays groundwork for broader human-computer interaction applications.

Abstract

In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward -- the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates competitive performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI Agents.

Paper Structure

This paper contains 17 sections, 9 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Framework comparison of previous Think-Answer and Blink-Think-Link in GUI tasks for RFT. Specifically, colorful text is supervised by rule-based reinforcement learning. And different colors of text indicate different reward rules. The previous "Think-Answer" framework is optimized by format reward, action type reward, and corresponding args reward. And our Blink-Think-Link framework is optimized by dual format reward, blink reward, and link reward.
  • Figure 2: Overall framework of BTL. We adopt Group Relative Policy Optimization (GRPO) to optimize the proposed BTL. Firstly, the base model generates $N$ completions for a given GUI task sample. Furthermore, GRPO computes the relative advantages within a group of completions, eliminating the need for manually annotated data. Finally, the policy model updates parameters under the guidance of relative advantages and the KL divergence constraint.
  • Figure 3: Two-stage data construction pipeline. In the first stage, the basic properties of UI elements are obtained by a parsing model. To eliminate the redundancy of the number and attributes of elements, the analysis model in the second stage simplifies the list to $\lambda$ elements with their positions (<bbox>), while the reserved <caption> attribute indicates whether the element is interactive. In the example shown in the figure, the instruction for the current step is "Use the GPS to locate a nearby museum and then book a ride with Lyft." Accordingly, the most relevant element in the Blink output is the "Maps & Navigation" app with <ID>10</ID>.
  • Figure 4: Visualization of the interaction trajectory of the proposed BTL-UI on AndroidControl-High. The corresponding ID of this random case is 19477. And the high-level instruction is 'Listen live to Radio GupShup 94.3 FM and search for other radio stations.' The tap icon in black is the prediction of BTL-UI, and the other is the ground-truth.