Table of Contents
Fetching ...

SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents

Wanxin Tian, Shijie Zhang, Kevin Zhang, Xiaowei Chi, Chunkai Fan, Junyu Lu, Yulin Luo, Qiang Zhou, Yiming Zhao, Ning Liu, Siyu Lin, Zhiyuan Qin, Xiaozhu Ju, Shanghang Zhang, Jian Tang

TL;DR

SEEA-R1 tackles the challenge of autonomous self-evolution in embodied agents by marrying reinforcement fine-tuning with self-generated signals. It introduces Tree-GRPO, which integrates Monte Carlo Tree Search into Group Relative Policy Optimization to densify sparse rewards across multi-step trajectories, and MGRM, a multimodal reward model that generalizes across tasks and environments. Through a closed-loop Data Evolution and Model Evolution framework, SEEA-R1 iteratively improves both policy and reward modeling, achieving state-of-the-art results on the ALFWorld benchmark under both supervised and self-supervised settings, and demonstrating notable real-world transfer through reflection-correction capabilities. The work highlights the potential of self-evolving embodied intelligence for scalable planning and reasoning in complex, long-horizon tasks, with practical implications for autonomous agents operating in diverse real-world environments.

Abstract

Self-evolution, the ability of agents to autonomously improve their reasoning and behavior, is essential for the embodied domain with long-horizon, real-world tasks. Despite current advancements in reinforcement fine-tuning (RFT) showing strong performance in enhancing reasoning in LLMs, its potential to enable self-evolving embodied intelligence with multi-modal interactions remains largely unexplored. Specifically, reinforcement fine-tuning faces two fundamental obstacles in embodied settings: (i) the lack of accessible intermediate rewards in multi-step reasoning tasks limits effective learning signals, and (ii) reliance on hand-crafted reward functions restricts generalization to novel tasks and environments. To address these challenges, we present Self-Evolving Embodied Agents-R1, SEEA-R1, the first RFT framework designed for enabling the self-evolving capabilities of embodied agents. Specifically, to convert sparse delayed rewards into denser intermediate signals that improve multi-step reasoning, we propose Tree-based group relative policy optimization (Tree-GRPO) integrates Monte Carlo Tree Search into GRPO. To generalize reward estimation across tasks and scenes, supporting autonomous adaptation and reward-driven self-evolution, we further introduce Multi-modal Generative Reward Model (MGRM). To holistically evaluate the effectiveness of SEEA-R1, we evaluate on the ALFWorld benchmark, surpassing state-of-the-art methods with scores of 85.07% (textual) and 46.27% (multi-modal), outperforming prior models including GPT-4o. SEEA-R1 also achieves scores of 80.3% (textual) and 44.03% (multi-modal) without ground truth reward, surpassing all open-source baselines and highlighting its scalability as a self-evolving embodied agent. Additional experiments and qualitative analysis further support the potential of SEEA-R1 for future research in scalable embodied intelligence.

SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents

TL;DR

SEEA-R1 tackles the challenge of autonomous self-evolution in embodied agents by marrying reinforcement fine-tuning with self-generated signals. It introduces Tree-GRPO, which integrates Monte Carlo Tree Search into Group Relative Policy Optimization to densify sparse rewards across multi-step trajectories, and MGRM, a multimodal reward model that generalizes across tasks and environments. Through a closed-loop Data Evolution and Model Evolution framework, SEEA-R1 iteratively improves both policy and reward modeling, achieving state-of-the-art results on the ALFWorld benchmark under both supervised and self-supervised settings, and demonstrating notable real-world transfer through reflection-correction capabilities. The work highlights the potential of self-evolving embodied intelligence for scalable planning and reasoning in complex, long-horizon tasks, with practical implications for autonomous agents operating in diverse real-world environments.

Abstract

Self-evolution, the ability of agents to autonomously improve their reasoning and behavior, is essential for the embodied domain with long-horizon, real-world tasks. Despite current advancements in reinforcement fine-tuning (RFT) showing strong performance in enhancing reasoning in LLMs, its potential to enable self-evolving embodied intelligence with multi-modal interactions remains largely unexplored. Specifically, reinforcement fine-tuning faces two fundamental obstacles in embodied settings: (i) the lack of accessible intermediate rewards in multi-step reasoning tasks limits effective learning signals, and (ii) reliance on hand-crafted reward functions restricts generalization to novel tasks and environments. To address these challenges, we present Self-Evolving Embodied Agents-R1, SEEA-R1, the first RFT framework designed for enabling the self-evolving capabilities of embodied agents. Specifically, to convert sparse delayed rewards into denser intermediate signals that improve multi-step reasoning, we propose Tree-based group relative policy optimization (Tree-GRPO) integrates Monte Carlo Tree Search into GRPO. To generalize reward estimation across tasks and scenes, supporting autonomous adaptation and reward-driven self-evolution, we further introduce Multi-modal Generative Reward Model (MGRM). To holistically evaluate the effectiveness of SEEA-R1, we evaluate on the ALFWorld benchmark, surpassing state-of-the-art methods with scores of 85.07% (textual) and 46.27% (multi-modal), outperforming prior models including GPT-4o. SEEA-R1 also achieves scores of 80.3% (textual) and 44.03% (multi-modal) without ground truth reward, surpassing all open-source baselines and highlighting its scalability as a self-evolving embodied agent. Additional experiments and qualitative analysis further support the potential of SEEA-R1 for future research in scalable embodied intelligence.

Paper Structure

This paper contains 77 sections, 3 equations, 16 figures, 10 tables, 1 algorithm.

Figures (16)

  • Figure 1: SEEA-R1 self-evolves by reasoning over its environment with perception-grounded planning. The agent explores task solutions using tree-based search guided by a reward model, iteratively refining actions to achieve complex goals. Given a high-level instruction, it explores, plans, and executes actions in an embodied environment.
  • Figure 2: SEEA-R1 framework. The framework drives continuous improvement through an iterative loop of two core cycles as follows: 1.Data Evolution: The Policy Model interacts with the environment via MCTS from an initial state to generate the experience dataset, containing trajectories with derived Q-values, ground truth rewards from the environment, and rewards from the current Reward Model. 2.Model Evolution: The collected data is used to update both models: (a) Policy Model to predict actions and (b) Reward Model to predict categorical outcomes. Refined models from Model Evolution then drive the next Data Evolution iteration, enabling continuous self-evolution.
  • Figure 3: Monte Carlo Tree Search (MCTS) in SEEA-R1.(a) Selection Traverse tree via UCT \ref{['eq:mcts_uct']} until reaching a leaf. (b) Expansion Execute action, observe result, and expand with new actions. (c) Simulation Roll out from new node to termination or depth limit, collecting reward $r$. (d) Backup Propagate rewards to update action values $Q$ using the formulation in Equation \ref{['eq:mcts_backup']}.
  • Figure 4: Visualization of SEEA-R1 executing the "put clothes into the washing machine" task in real-world settings, which demonstrates reflection-correction capability.
  • Figure 5: Performance comparison of SEEA-R1 using different optimization algorithms on the multi-modal scenario of ALFWorld Benchmark over training iterations, more detailed figures are provided in Appendix \ref{['app:comparison_train_method']}.
  • ...and 11 more figures