Table of Contents
Fetching ...

Few-Shot Vision-Language Action-Incremental Policy Learning

Mingchen Song, Xiang Deng, Guoqiang Zhong, Qi Lv, Jia Wan, Yinchuan Li, Jianye Hao, Weili Guan

TL;DR

This work tackles data-scarce robotic manipulation by formulating Few-Shot Action-Incremental Learning (FSAIL) and introducing TOPIC, a Transformer-augmentation that combines Task-Specific Prompts (TSP) with a Continuous Evolution Strategy (CES). TSP enables deep cross-modal integration from few demonstrations to extract task-discriminative signals, while CES builds a task relation graph to reuse learned skills and mitigate forgetting during continual learning. The method is validated on RLBench with 10 base tasks and 5 incremental tasks (1-shot and 5-shot) and demonstrated to outperform state-of-the-art Transformer baselines by up to ~28 percentage points in success rate, as well as outperforming classical continual-learning methods. Real-world experiments on a Cobot Mobile ALOHA corroborate the approach’s practical viability, illustrating improved continual adaptation with limited demonstrations and reduced catastrophic forgetting in embodied tasks.

Abstract

Recently, Transformer-based robotic manipulation methods utilize multi-view spatial representations and language instructions to learn robot motion trajectories by leveraging numerous robot demonstrations. However, the collection of robot data is extremely challenging, and existing methods lack the capability for continuous learning on new tasks with only a few demonstrations. In this paper, we formulate these challenges as the Few-Shot Action-Incremental Learning (FSAIL) task, and accordingly design a Task-prOmpt graPh evolutIon poliCy (TOPIC) to address these issues. Specifically, to address the data scarcity issue in robotic imitation learning, TOPIC learns Task-Specific Prompts (TSP) through the deep interaction of multi-modal information within few-shot demonstrations, thereby effectively extracting the task-specific discriminative information. On the other hand, to enhance the capability for continual learning on new tasks and mitigate the issue of catastrophic forgetting, TOPIC adopts a Continuous Evolution Strategy (CES). CES leverages the intrinsic relationships between tasks to construct a task relation graph, which effectively facilitates the adaptation of new tasks by reusing skills learned from previous tasks. TOPIC pioneers few-shot continual learning in the robotic manipulation task, and extensive experimental results demonstrate that TOPIC outperforms state-of-the-art baselines by over 26$\%$ in success rate, significantly enhancing the continual learning capabilities of existing Transformer-based policies.

Few-Shot Vision-Language Action-Incremental Policy Learning

TL;DR

This work tackles data-scarce robotic manipulation by formulating Few-Shot Action-Incremental Learning (FSAIL) and introducing TOPIC, a Transformer-augmentation that combines Task-Specific Prompts (TSP) with a Continuous Evolution Strategy (CES). TSP enables deep cross-modal integration from few demonstrations to extract task-discriminative signals, while CES builds a task relation graph to reuse learned skills and mitigate forgetting during continual learning. The method is validated on RLBench with 10 base tasks and 5 incremental tasks (1-shot and 5-shot) and demonstrated to outperform state-of-the-art Transformer baselines by up to ~28 percentage points in success rate, as well as outperforming classical continual-learning methods. Real-world experiments on a Cobot Mobile ALOHA corroborate the approach’s practical viability, illustrating improved continual adaptation with limited demonstrations and reduced catastrophic forgetting in embodied tasks.

Abstract

Recently, Transformer-based robotic manipulation methods utilize multi-view spatial representations and language instructions to learn robot motion trajectories by leveraging numerous robot demonstrations. However, the collection of robot data is extremely challenging, and existing methods lack the capability for continuous learning on new tasks with only a few demonstrations. In this paper, we formulate these challenges as the Few-Shot Action-Incremental Learning (FSAIL) task, and accordingly design a Task-prOmpt graPh evolutIon poliCy (TOPIC) to address these issues. Specifically, to address the data scarcity issue in robotic imitation learning, TOPIC learns Task-Specific Prompts (TSP) through the deep interaction of multi-modal information within few-shot demonstrations, thereby effectively extracting the task-specific discriminative information. On the other hand, to enhance the capability for continual learning on new tasks and mitigate the issue of catastrophic forgetting, TOPIC adopts a Continuous Evolution Strategy (CES). CES leverages the intrinsic relationships between tasks to construct a task relation graph, which effectively facilitates the adaptation of new tasks by reusing skills learned from previous tasks. TOPIC pioneers few-shot continual learning in the robotic manipulation task, and extensive experimental results demonstrate that TOPIC outperforms state-of-the-art baselines by over 26 in success rate, significantly enhancing the continual learning capabilities of existing Transformer-based policies.

Paper Structure

This paper contains 34 sections, 12 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Illustration of our proposed Task-prOmpt graPh evolutIon poliCy (TOPIC) for FSAIL. We learn task-specific prompts and construct a task relation graph with few-shot demonstrations. TOPIC has the ability to perform adaptive policy weights based on the intrinsic relationships between different tasks through a continuous evolution strategy.
  • Figure 2: Comparisons of different Transformer-based policies and our proposed TOPIC. (a): Transformer-based policies include a series of methods such as RVT, RVT2, SAM-E, and others. (b): Our proposed TOPIC, which can be flexibly integrated with other Transformer-based policies to enhance their continual learning capability with few-shot demonstrations.
  • Figure 3: The structure of our proposed Task-Specific Prompts (TSP) involves a set of predefined learnable prompt vectors, which interact deeply with information from other modalities through the Multi-View Transformer Encoder. TSP extracts task-specific discriminative information within a few demonstrations.
  • Figure 4: FSAIL Tasks in RLBench. We design 10 tasks in the base session and 5 tasks in the incremental session to validate the model's continual learning capability with novel objects and actions.
  • Figure 5: Exploring the impact of the number of task-specific prompts. We report the average accuracy across all sessions.
  • ...and 4 more figures