Table of Contents
Fetching ...

K^2-Agent: Co-Evolving Know-What and Know-How for Hierarchical Mobile Device Control

Zhe Wu, Donglin Mo, Hongjin Lu, Junliang Xing, Jianheng Liu, Yuheng Jing, Kai Li, Kun Shao, Jianye Hao, Yuanchun Shi

TL;DR

K2-Agent is a hierarchical framework that models human-like cognition by separating and co-evolving declarative and procedural knowledge for planning and execution by separating and co-evolving declarative and procedural knowledge for planning and execution.

Abstract

Existing mobile device control agents often perform poorly when solving complex tasks requiring long-horizon planning and precise operations, typically due to a lack of relevant task experience or unfamiliarity with skill execution. We propose K2-Agent, a hierarchical framework that models human-like cognition by separating and co-evolving declarative (knowing what) and procedural (knowing how) knowledge for planning and execution. K2-Agent's high level reasoner is bootstrapped from a single demonstration per task and runs a Summarize-Reflect-Locate-Revise (SRLR) loop to distill and iteratively refine task-level declarative knowledge through self-evolution. The low-level executor is trained with our curriculum-guided Group Relative Policy Optimization (C-GRPO), which (i) constructs a balanced sample pool using decoupled reward signals and (ii) employs dynamic demonstration injection to guide the model in autonomously generating successful trajectories for training. On the challenging AndroidWorld benchmark, K2-Agent achieves a 76.1% success rate using only raw screenshots and open-source backbones. Furthermore, K2-Agent shows powerful dual generalization: its high-level declarative knowledge transfers across diverse base models, while its low-level procedural skills achieve competitive performance on unseen tasks in ScreenSpot-v2 and Android-in-the-Wild (AitW).

K^2-Agent: Co-Evolving Know-What and Know-How for Hierarchical Mobile Device Control

TL;DR

K2-Agent is a hierarchical framework that models human-like cognition by separating and co-evolving declarative and procedural knowledge for planning and execution by separating and co-evolving declarative and procedural knowledge for planning and execution.

Abstract

Existing mobile device control agents often perform poorly when solving complex tasks requiring long-horizon planning and precise operations, typically due to a lack of relevant task experience or unfamiliarity with skill execution. We propose K2-Agent, a hierarchical framework that models human-like cognition by separating and co-evolving declarative (knowing what) and procedural (knowing how) knowledge for planning and execution. K2-Agent's high level reasoner is bootstrapped from a single demonstration per task and runs a Summarize-Reflect-Locate-Revise (SRLR) loop to distill and iteratively refine task-level declarative knowledge through self-evolution. The low-level executor is trained with our curriculum-guided Group Relative Policy Optimization (C-GRPO), which (i) constructs a balanced sample pool using decoupled reward signals and (ii) employs dynamic demonstration injection to guide the model in autonomously generating successful trajectories for training. On the challenging AndroidWorld benchmark, K2-Agent achieves a 76.1% success rate using only raw screenshots and open-source backbones. Furthermore, K2-Agent shows powerful dual generalization: its high-level declarative knowledge transfers across diverse base models, while its low-level procedural skills achieve competitive performance on unseen tasks in ScreenSpot-v2 and Android-in-the-Wild (AitW).
Paper Structure (25 sections, 8 equations, 6 figures, 4 tables)

This paper contains 25 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: K²-Agent's co-evolutionary learning curve on AndroidWorld. The main curve shows the agent's success rate steadily improving. Ablations (lower curves) confirm the contribution of key components, and subplots below illustrate the expanding mastery over new apps and tasks.
  • Figure 2: Overview of the K²-Agent. Top: The SRLR loop where declarative knowledge (knowing what) is iteratively improved using feedback. Bottom: The skill acquisition process where procedural knowledge (knowing how) is learned via C-GRPO, bootstrapped from a single demonstration.
  • Figure 3: An illustration of the SRLR self-evolution loop. (1) Summarize: An initial knowledge base is automatically distilled from a demonstration. (2) Reflect: The agent analyzes its execution trace to identify deviations. (3) Locate: The failure's root cause is pinpointed. (4) Revise: Atomic operators repair the knowledge base for the next iteration.
  • Figure 4: Our C-GRPO framework, featuring its two main curriculum components: Error-Decoupled Replay Balancing (left) to construct balanced mini-batches, and Dynamic Demonstration Injection (center) to provide adaptive guidance for the GRPO update (right).
  • Figure 5: Ablation and Component Analysis.(a): The declarative knowledge from our SRLR loop provides a substantial performance boost across four different powerful VLM backbones. (b): Training reward curves for the low-level executor comparing full C-GRPO (blue), C-GRPO without replay balancing (red), and C-GRPO without demonstration injection (green).
  • ...and 1 more figures