Table of Contents
Fetching ...

Never-Ending Behavior-Cloning Agent for Robotic Manipulation

Wenqi Liang, Gan Sun, Yao He, Yu Ren, Jiahua Dong, Yang Cong

TL;DR

Robotic manipulation in open-world settings demands continual learning with robust 3D scene understanding. The authors propose NBAgent, a Never-ending Behavior-cloning Agent, with three key components: SSR for transferring 3D scene semantics via NeRF-guided supervision, SRD for cross-task representation distillation, and SEP for skill-specific knowledge via a dynamic latent space and LoRA adapters. The approach is evaluated on RLBench and a never-ending benchmark with Kitchen and Living Room tasks, showing state-of-the-art performance and improved resistance to forgetting compared to strong baselines. This work advances open-ended, language-conditioned manipulation by separating skill-shared and skill-specific knowledge and demonstrates practical potential for continual robotic learning in 3D-rich environments.

Abstract

Relying on multi-modal observations, embodied robots (e.g., humanoid robots) could perform multiple robotic manipulation tasks in unstructured real-world environments. However, most language-conditioned behavior-cloning agents in robots still face existing long-standing challenges, i.e., 3D scene representation and human-level task learning, when adapting into a series of new tasks in practical scenarios. We here investigate these above challenges with NBAgent in embodied robots, a pioneering language-conditioned Never-ending Behavior-cloning Agent, which can continually learn observation knowledge of novel 3D scene semantics and robot manipulation skills from skill-shared and skill-specific attributes, respectively. Specifically, we propose a skill-shared semantic rendering module and a skill-shared representation distillation module to effectively learn 3D scene semantics from skill-shared attribute, further tackling 3D scene representation overlooking. Meanwhile, we establish a skill-specific evolving planner to perform manipulation knowledge decoupling, which can continually embed novel skill-specific knowledge like human from latent and low-rank space. Finally, we design a never-ending embodied robot manipulation benchmark, and expensive experiments demonstrate the significant performance of our method.

Never-Ending Behavior-Cloning Agent for Robotic Manipulation

TL;DR

Robotic manipulation in open-world settings demands continual learning with robust 3D scene understanding. The authors propose NBAgent, a Never-ending Behavior-cloning Agent, with three key components: SSR for transferring 3D scene semantics via NeRF-guided supervision, SRD for cross-task representation distillation, and SEP for skill-specific knowledge via a dynamic latent space and LoRA adapters. The approach is evaluated on RLBench and a never-ending benchmark with Kitchen and Living Room tasks, showing state-of-the-art performance and improved resistance to forgetting compared to strong baselines. This work advances open-ended, language-conditioned manipulation by separating skill-shared and skill-specific knowledge and demonstrates practical potential for continual robotic learning in 3D-rich environments.

Abstract

Relying on multi-modal observations, embodied robots (e.g., humanoid robots) could perform multiple robotic manipulation tasks in unstructured real-world environments. However, most language-conditioned behavior-cloning agents in robots still face existing long-standing challenges, i.e., 3D scene representation and human-level task learning, when adapting into a series of new tasks in practical scenarios. We here investigate these above challenges with NBAgent in embodied robots, a pioneering language-conditioned Never-ending Behavior-cloning Agent, which can continually learn observation knowledge of novel 3D scene semantics and robot manipulation skills from skill-shared and skill-specific attributes, respectively. Specifically, we propose a skill-shared semantic rendering module and a skill-shared representation distillation module to effectively learn 3D scene semantics from skill-shared attribute, further tackling 3D scene representation overlooking. Meanwhile, we establish a skill-specific evolving planner to perform manipulation knowledge decoupling, which can continually embed novel skill-specific knowledge like human from latent and low-rank space. Finally, we design a never-ending embodied robot manipulation benchmark, and expensive experiments demonstrate the significant performance of our method.
Paper Structure (16 sections, 13 equations, 9 figures, 13 tables, 2 algorithms)

This paper contains 16 sections, 13 equations, 9 figures, 13 tables, 2 algorithms.

Figures (9)

  • Figure 1: Demonstration illustration of our proposed never-ending behavior-cloning robot learning. As illustrated in (a), behavior-cloning robot learning primarily focuses on initially training on a fixed dataset, subsequently relying on the generalization capability to execute tasks in unseen environments, where a pre-trained CLIP model radford2021learning serves as the the language encoder to process the language instruction. As depicted in (b), the never-ending behavior-cloning framework enables robotic systems to progressively acquire novel manipulation skills in a continual learning manner, thereby demonstrating enhanced adaptability and generalization capabilities when confronted with unseen and challenging tasks.
  • Figure 2: Overview of the proposed NBAgent. It consists of a skill-shared semantic rendering module and a skill-shared representation distillation loss$\mathcal{L}_{\mathrm{SRD}}$ to transfer skill-shared knowledge on semantics of 3D scenes and overcome 3D reasoning overlooking in continual learning, and a skill-specific evolving planner to learn skill-specific knowledge, addressing catastrophic forgetting on learned skills.
  • Figure 3: Prediction examples on RLBench james2020rlbench. For qualitative evaluation, we visualize three key frames from each manipulation task across all methods.
  • Figure 4: Visualization of manipulation skills in dataset Kitchen, consisting of 10 manipulation skills pertinent to kitchen environments.
  • Figure 5: Visualization of manipulation skills in dataset Living Room, including 12 manipulation skills associated with living room scenarios.
  • ...and 4 more figures