Table of Contents
Fetching ...

CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning

Zhenquan Yao, Zitong Huang, Yihan Zeng, Jianhua Han, Hang Xu, Chun-Mei Feng, Jianwei Ma, Wangmeng Zuo

TL;DR

This work introduces an SFT proportion adjustment mechanism guided by policy entropy to dynamically control the weight allocation between the SFT and RL training phases and establishes an AndroidControl-CL benchmark, which divides GUI applications into distinct task groups to effectively simulate and evaluate the performance of continual GUI learning.

Abstract

Graphical User Interface (GUI) Agents, benefiting from recent advances in multimodal large language models (MLLM), have achieved significant development. However, due to the frequent updates of GUI applications, adapting to new tasks without forgetting old tasks in GUI continual learning remains an open problem. In this work, we reveal that while Supervised Fine-Tuning (SFT) facilitates fast adaptation, it often triggers knowledge overwriting, whereas Reinforcement Learning (RL) demonstrates an inherent resilience that shields prior interaction logic from erasure. Based on this insight, we propose a \textbf{C}ontinual \textbf{G}UI \textbf{L}earning (CGL) framework that dynamically balances adaptation efficiency and skill retention by enhancing the synergy between SFT and RL. Specifically, we introduce an SFT proportion adjustment mechanism guided by policy entropy to dynamically control the weight allocation between the SFT and RL training phases. To resolve explicit gradient interference, we further develop a specialized gradient surgery strategy. By projecting exploratory SFT gradients onto GRPO-based anchor gradients, our method explicitly clips the components of SFT gradients that conflict with GRPO. On top of that, we establish an AndroidControl-CL benchmark, which divides GUI applications into distinct task groups to effectively simulate and evaluate the performance of continual GUI learning. Experimental results demonstrate the effectiveness of our proposed CGL framework across continual learning scenarios. The benchmark, code, and model will be made publicly available.

CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning

TL;DR

This work introduces an SFT proportion adjustment mechanism guided by policy entropy to dynamically control the weight allocation between the SFT and RL training phases and establishes an AndroidControl-CL benchmark, which divides GUI applications into distinct task groups to effectively simulate and evaluate the performance of continual GUI learning.

Abstract

Graphical User Interface (GUI) Agents, benefiting from recent advances in multimodal large language models (MLLM), have achieved significant development. However, due to the frequent updates of GUI applications, adapting to new tasks without forgetting old tasks in GUI continual learning remains an open problem. In this work, we reveal that while Supervised Fine-Tuning (SFT) facilitates fast adaptation, it often triggers knowledge overwriting, whereas Reinforcement Learning (RL) demonstrates an inherent resilience that shields prior interaction logic from erasure. Based on this insight, we propose a \textbf{C}ontinual \textbf{G}UI \textbf{L}earning (CGL) framework that dynamically balances adaptation efficiency and skill retention by enhancing the synergy between SFT and RL. Specifically, we introduce an SFT proportion adjustment mechanism guided by policy entropy to dynamically control the weight allocation between the SFT and RL training phases. To resolve explicit gradient interference, we further develop a specialized gradient surgery strategy. By projecting exploratory SFT gradients onto GRPO-based anchor gradients, our method explicitly clips the components of SFT gradients that conflict with GRPO. On top of that, we establish an AndroidControl-CL benchmark, which divides GUI applications into distinct task groups to effectively simulate and evaluate the performance of continual GUI learning. Experimental results demonstrate the effectiveness of our proposed CGL framework across continual learning scenarios. The benchmark, code, and model will be made publicly available.
Paper Structure (41 sections, 3 theorems, 16 equations, 27 figures, 13 tables)

This paper contains 41 sections, 3 theorems, 16 equations, 27 figures, 13 tables.

Key Result

Lemma 1

The first-order approximation of the change in policy entropy $\Delta \mathcal{H}$ induced by a logit update vector $\Delta \mathbf{z}_s$ is determined by the negative covariance between the log-probabilities and the update values: where $\eta$ is the learning rate.

Figures (27)

  • Figure 1: Overview of Continual GUI Learning (CGL). (a) GUI environments evolve as new app categories emerge, causing static GUI agents to misalign with the real world. (b) Existing paradigms struggle to balance stability in legacy tasks and plasticity in novel task accuracy. (c) Our CGL framework achieves balanced adaptation and retention.
  • Figure 2: Preliminary comparison of SFT and GRPO on the LLaVA-OneVision-0.5B model. (a) Adaptation performance on a newly introduced task: SFT demonstrates fast plasticity, whereas GRPO exhibits slower adaptation. (b) Forgetting of the initial task after sequential training: SFT undergoes substantial degradation, while GRPO maintains higher retention.
  • Figure 3: Overview of the proposed CGL framework.(Left) Error-Aware Routing pipeline:It dynamically routes data based on prediction feedback, where erroneous actions trigger SFT knowledge injection to facilitate correction. (Top Right) Gradient Surgery: Resolves directional conflicts between GRPO and SFT gradients through projection to maintain optimization stability. (Bottom Right) Entropy-Regulated Tuning: A dynamic strategy that manages the exploration-exploitation trade-off by adjusting the SFT weight $\lambda$. The bottom bar charts illustrate the model’s action probability distribution $p$, where the yellow shaded area denotes the ground truth ($gt$) action range.
  • Figure 4: Trends of policy entropy (top) and SFT loss weight $\lambda$ (bottom) across training steps for Entropy-Regulated Tuning.
  • Figure 5: High-accuracy region comparison of GRPO and our CGL in continual GUI learning.
  • ...and 22 more figures

Theorems & Definitions (6)

  • Lemma 1: Entropy-Covariance Relationship
  • proof
  • Lemma 2: GRPO Update Dynamics
  • proof
  • Lemma 3: SFT Entropy Injection
  • proof