Table of Contents
Fetching ...

GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation

Tao Liu, Chongyu Wang, Rongjie Li, Yingchen Yu, Xuming He, Bai Song

TL;DR

GUI-Rise addresses the challenge of robust, long-horizon GUI navigation by integrating structured reasoning, action prediction, and compact history summarization within a multimodal LLM framework. It uses a two-stage training regime—supervised fine-tuning on pseudo-labeled traces and reinforcement learning via Group Relative Policy Optimization (GRPO) with format, action, and history rewards—to align reasoning quality with execution accuracy. Across Mind2Web, AITW, GUIAct, and MiniWob, GUI-Rise achieves state-of-the-art performance, particularly in out-of-domain and online settings, demonstrating strong generalization and stability for complex multi-step GUI tasks. The approach improves memory efficiency and interpretability through explicit CoT reasoning and dense yet compact history representations, enabling practical deployment in real-world, dynamic interfaces.

Abstract

While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework's ability to maintain robust reasoning and generalization across diverse GUI navigation tasks. Code is available at https://leon022.github.io/GUI-Rise.

GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation

TL;DR

GUI-Rise addresses the challenge of robust, long-horizon GUI navigation by integrating structured reasoning, action prediction, and compact history summarization within a multimodal LLM framework. It uses a two-stage training regime—supervised fine-tuning on pseudo-labeled traces and reinforcement learning via Group Relative Policy Optimization (GRPO) with format, action, and history rewards—to align reasoning quality with execution accuracy. Across Mind2Web, AITW, GUIAct, and MiniWob, GUI-Rise achieves state-of-the-art performance, particularly in out-of-domain and online settings, demonstrating strong generalization and stability for complex multi-step GUI tasks. The approach improves memory efficiency and interpretability through explicit CoT reasoning and dense yet compact history representations, enabling practical deployment in real-world, dynamic interfaces.

Abstract

While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework's ability to maintain robust reasoning and generalization across diverse GUI navigation tasks. Code is available at https://leon022.github.io/GUI-Rise.

Paper Structure

This paper contains 36 sections, 15 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: GUI-Rise agent framework overview. It introduces a three-subtask framework that integrates structured reasoning, action prediction, and history summarization. At each step, the agent performs structured reasoning (progress estimation and decision analysis), predicts the next GUI action, and updates a compact history summary for the next iteration.
  • Figure 2: Overview of the GUI-Rise training pipeline. The training consists of two stages: (1) supervised learning with pseudo-labeled summaries and ground truth action trajectories to initialize reasoning, and (2) reinforcement learning with rule-based and model-based rewards to improve decision-making and generalization.
  • Figure 3: GUI-Rise training process on Mind2Web benchmark.
  • Figure 4: GUI-Rise training process on AITW benchmark.
  • Figure 5: Impact by different history representation in GUI navigation in Mind2Web benchmark.
  • ...and 2 more figures