Table of Contents
Fetching ...

Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning

Kaichen He, Zihao Wang, Muyao Li, Anji Liu, Yitao Liang

TL;DR

CrossAgent introduces a unified agent capable of mastering heterogeneous action spaces and autonomously selecting the most effective interface for each step in a trajectory. The method uses a three-stage pipeline—cold-start mixed-space SFT, Single-Turn RL with GRPO, and Multi-Turn RL (MTRL) enabled by self-training initialization—to learn dynamic action-space switching. Evaluated on the OpenHA Minecraft benchmark with over 800 tasks, CrossAgent achieves state-of-the-art performance and strong generalization, substantially outperforming fixed-space baselines. This work advances open-world, generalist agents by showing that learnable, context-aware interface switching can achieve both efficiency and robustness in long-horizon reasoning.

Abstract

The paradigm of agentic AI is shifting from engineered complex workflows to post-training native models. However, existing agents are typically confined to static, predefined action spaces--such as exclusively using APIs, GUI events, or robotic commands. This rigidity limits their adaptability in dynamic environments where the optimal granularity of interaction varies contextually. To bridge this gap, we propose CrossAgent, a unified agentic model that masters heterogeneous action spaces and autonomously selects the most effective interface for each step of a trajectory. We introduce a comprehensive training pipeline that integrates cold-start supervised fine-tuning with a Multi-Turn Group Relative Policy Optimization (GRPO) algorithm. This approach enables the agent to learn adaptive action switching--balancing high-level efficiency with low-level precision--without human-specified rules. Extensive experiments on over 800 tasks in the open-world Minecraft environment demonstrate that CrossAgent achieves state-of-the-art performance. By dynamically leveraging the strengths of diverse action spaces, our model significantly outperforms fixed-action baselines, exhibiting superior generalization and efficiency in long-horizon reasoning. All code and models are available at https://github.com/CraftJarvis/OpenHA

Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning

TL;DR

CrossAgent introduces a unified agent capable of mastering heterogeneous action spaces and autonomously selecting the most effective interface for each step in a trajectory. The method uses a three-stage pipeline—cold-start mixed-space SFT, Single-Turn RL with GRPO, and Multi-Turn RL (MTRL) enabled by self-training initialization—to learn dynamic action-space switching. Evaluated on the OpenHA Minecraft benchmark with over 800 tasks, CrossAgent achieves state-of-the-art performance and strong generalization, substantially outperforming fixed-space baselines. This work advances open-world, generalist agents by showing that learnable, context-aware interface switching can achieve both efficiency and robustness in long-horizon reasoning.

Abstract

The paradigm of agentic AI is shifting from engineered complex workflows to post-training native models. However, existing agents are typically confined to static, predefined action spaces--such as exclusively using APIs, GUI events, or robotic commands. This rigidity limits their adaptability in dynamic environments where the optimal granularity of interaction varies contextually. To bridge this gap, we propose CrossAgent, a unified agentic model that masters heterogeneous action spaces and autonomously selects the most effective interface for each step of a trajectory. We introduce a comprehensive training pipeline that integrates cold-start supervised fine-tuning with a Multi-Turn Group Relative Policy Optimization (GRPO) algorithm. This approach enables the agent to learn adaptive action switching--balancing high-level efficiency with low-level precision--without human-specified rules. Extensive experiments on over 800 tasks in the open-world Minecraft environment demonstrate that CrossAgent achieves state-of-the-art performance. By dynamically leveraging the strengths of diverse action spaces, our model significantly outperforms fixed-action baselines, exhibiting superior generalization and efficiency in long-horizon reasoning. All code and models are available at https://github.com/CraftJarvis/OpenHA

Paper Structure

This paper contains 50 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The CrossAgent Framework. Unlike prior methods that confine the agent to a fixed action space (e.g., atomic movements) throughout a trajectory, CrossAgent dynamically switches across different action spaces to adapt to the context.
  • Figure 2: Overview of the CrossAgent Training Pipeline. The pipeline comprises three distinct stages: Cold-Start Supervised Fine-Tuning (SFT), Single-Turn Reinforcement Learning (STRL), and Multi-Turn Reinforcement Learning (MTRL). In the first stage, the model learns to decode actions from a heterogeneous action space using a balanced dataset. During STRL, the model is fine-tuned to autonomously select the appropriate action space based on the immediate task context. Finally, in the MTRL stage, the policy is further optimized to balance task success rate with execution efficiency over long horizons. This progressive pipeline ensures CrossAgent effectively adapts its action granularity across a wide range of tasks.
  • Figure 3: Performance Comparison Across Action Spaces. The heterogeneous action space of CrossAgent enables superior data efficiency and higher asymptotic performance during multi-turn reinforcement learning, compared to single-space baselines.
  • Figure 4: Effect of the Single-Turn RL (STRL) Stage. Training curves comparing CrossAgent with and without the STRL phase. The inclusion of STRL significantly enhances training efficiency and accelerates convergence in the subsequent MTRL stage, despite its low computational cost.
  • Figure 5: Case Study: Action distribution during the Kill Sheep, Chop Tree and Craft Enchanting task. The density curves of each tasks, aggregated over 20 episodes, of different action spaces (Motion, Grounding, Raw) across different task phases. The dynamic shifts in distribution demonstrate the model's in-context adaptive strategy.
  • ...and 2 more figures