Table of Contents
Fetching ...

Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own

Weirui Ye, Yunsheng Zhang, Haoyang Weng, Xianfan Gu, Shengjie Wang, Tong Zhang, Mengchen Wang, Pieter Abbeel, Yang Gao

TL;DR

The paper tackles data inefficiency and reward design challenges in reinforcement learning for robotics by introducing Reinforcement Learning with Foundation Priors (RLFP). RLFP leverages three foundation priors—policy, value, and success-reward—to guide learning, instantiated as the Foundation-guided Actor-Critic (FAC) algorithm that combines success-imitation, policy regularization toward priors, and reward shaping from a value prior. Empirical results on real Franka manipulation tasks and simulated Meta-World show strong sample efficiency, with FAC achieving 86% success after 1 hour on real robots and 7/8 tasks at under 100k frames in simulation, outperforming baselines. The approach is demonstrated to be robust to noisy priors and agnostic to the exact form of foundation models, signaling meaningful potential for autonomous, real-world robot learning as foundation models continue to improve.

Abstract

Reinforcement learning (RL) is a promising approach for solving robotic manipulation tasks. However, it is challenging to apply the RL algorithms directly in the real world. For one thing, RL is data-intensive and typically requires millions of interactions with environments, which are impractical in real scenarios. For another, it is necessary to make heavy engineering efforts to design reward functions manually. To address these issues, we leverage foundation models in this paper. We propose Reinforcement Learning with Foundation Priors (RLFP) to utilize guidance and feedback from policy, value, and success-reward foundation models. Within this framework, we introduce the Foundation-guided Actor-Critic (FAC) algorithm, which enables embodied agents to explore more efficiently with automatic reward functions. The benefits of our framework are threefold: (1) \textit{sample efficient}; (2) \textit{minimal and effective reward engineering}; (3) \textit{agnostic to foundation model forms and robust to noisy priors}. Our method achieves remarkable performances in various manipulation tasks on both real robots and in simulation. Across 5 dexterous tasks with real robots, FAC achieves an average success rate of 86\% after one hour of real-time learning. Across 8 tasks in the simulated Meta-world, FAC achieves 100\% success rates in 7/8 tasks under less than 100k frames (about 1-hour training), outperforming baseline methods with manual-designed rewards in 1M frames. We believe the RLFP framework can enable future robots to explore and learn autonomously in the physical world for more tasks. Visualizations and code are available at \url{https://yewr.github.io/rlfp}.

Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own

TL;DR

The paper tackles data inefficiency and reward design challenges in reinforcement learning for robotics by introducing Reinforcement Learning with Foundation Priors (RLFP). RLFP leverages three foundation priors—policy, value, and success-reward—to guide learning, instantiated as the Foundation-guided Actor-Critic (FAC) algorithm that combines success-imitation, policy regularization toward priors, and reward shaping from a value prior. Empirical results on real Franka manipulation tasks and simulated Meta-World show strong sample efficiency, with FAC achieving 86% success after 1 hour on real robots and 7/8 tasks at under 100k frames in simulation, outperforming baselines. The approach is demonstrated to be robust to noisy priors and agnostic to the exact form of foundation models, signaling meaningful potential for autonomous, real-world robot learning as foundation models continue to improve.

Abstract

Reinforcement learning (RL) is a promising approach for solving robotic manipulation tasks. However, it is challenging to apply the RL algorithms directly in the real world. For one thing, RL is data-intensive and typically requires millions of interactions with environments, which are impractical in real scenarios. For another, it is necessary to make heavy engineering efforts to design reward functions manually. To address these issues, we leverage foundation models in this paper. We propose Reinforcement Learning with Foundation Priors (RLFP) to utilize guidance and feedback from policy, value, and success-reward foundation models. Within this framework, we introduce the Foundation-guided Actor-Critic (FAC) algorithm, which enables embodied agents to explore more efficiently with automatic reward functions. The benefits of our framework are threefold: (1) \textit{sample efficient}; (2) \textit{minimal and effective reward engineering}; (3) \textit{agnostic to foundation model forms and robust to noisy priors}. Our method achieves remarkable performances in various manipulation tasks on both real robots and in simulation. Across 5 dexterous tasks with real robots, FAC achieves an average success rate of 86\% after one hour of real-time learning. Across 8 tasks in the simulated Meta-world, FAC achieves 100\% success rates in 7/8 tasks under less than 100k frames (about 1-hour training), outperforming baseline methods with manual-designed rewards in 1M frames. We believe the RLFP framework can enable future robots to explore and learn autonomously in the physical world for more tasks. Visualizations and code are available at \url{https://yewr.github.io/rlfp}.
Paper Structure (21 sections, 3 theorems, 15 equations, 13 figures, 6 tables)

This paper contains 21 sections, 3 theorems, 15 equations, 13 figures, 6 tables.

Key Result

Theorem 1

reward-shaping Suppose that $F$ takes the form of $F(s, a, s') = \gamma \Phi(s') - \Phi(s)$, $\Phi(s_0) = 0$ if $\gamma=1$, then for $\forall s \in \mathcal{S}, a \in \mathcal{A}$, the potential-based $F$ preserve optimal policies and we have:

Figures (13)

  • Figure 1: An example of how human solves tasks under the policy, value, and success-reward prior knowledge. The proposed Reinforcement Learning from Foundation Priors framework utilizes the corresponding foundation models to acquire prior knowledge.
  • Figure 2: The overview of Foundation-guided Actor-Critic. In FAC, rewards are derived from foundation success rewards and value shaping. Besides policy gradient updates, the actor is trained using prior policy regularization and success trajectory imitation.
  • Figure 3: Five tasks on real robots, demonstrating the efficiency and accuracy of FAC in real.
  • Figure 4: During training, the agent progressively favors actions from the actor, reducing reliance on the prior policy.
  • Figure 5: Prior policy attempts to open the door without a successful grasp, whereas FAC persistently tries to secure the handle before pulling back the arm.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Lemma 1
  • Proof 1
  • Theorem 2
  • Proof 2