Table of Contents
Fetching ...

Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation

Teli Ma, Jiaming Zhou, Zifan Wang, Ronghe Qiu, Junwei Liang

TL;DR

Sigma-Agent introduces contrastive imitation learning to align vision-language and current-future representations for language-guided multi-task robotic manipulation. By integrating an MVQ-Former to efficiently fuse multi-view RGB-D data and freezing the language encoder while applying contrastive losses to refine both feature extraction and vision-language interaction, the method achieves state-of-the-art performance on RLBench across 18 tasks and demonstrates practical real-world capabilities with a single policy. The approach generalizes to existing baselines by plugging in the contrastive IL module, underscoring its utility for enhancing multi-modal perception and control in robotics. Overall, the work offers a scalable, end-to-end framework for discriminative, language-conditioned manipulation with clear improvements in sample efficiency and task differentiation.

Abstract

Developing robots capable of executing various manipulation tasks, guided by natural language instructions and visual observations of intricate real-world environments, remains a significant challenge in robotics. Such robot agents need to understand linguistic commands and distinguish between the requirements of different tasks. In this work, we present Sigma-Agent, an end-to-end imitation learning agent for multi-task robotic manipulation. Sigma-Agent incorporates contrastive Imitation Learning (contrastive IL) modules to strengthen vision-language and current-future representations. An effective and efficient multi-view querying Transformer (MVQ-Former) for aggregating representative semantic information is introduced. Sigma-Agent shows substantial improvement over state-of-the-art methods under diverse settings in 18 RLBench tasks, surpassing RVT by an average of 5.2% and 5.9% in 10 and 100 demonstration training, respectively. Sigma-Agent also achieves 62% success rate with a single policy in 5 real-world manipulation tasks. The code will be released upon acceptance.

Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation

TL;DR

Sigma-Agent introduces contrastive imitation learning to align vision-language and current-future representations for language-guided multi-task robotic manipulation. By integrating an MVQ-Former to efficiently fuse multi-view RGB-D data and freezing the language encoder while applying contrastive losses to refine both feature extraction and vision-language interaction, the method achieves state-of-the-art performance on RLBench across 18 tasks and demonstrates practical real-world capabilities with a single policy. The approach generalizes to existing baselines by plugging in the contrastive IL module, underscoring its utility for enhancing multi-modal perception and control in robotics. Overall, the work offers a scalable, end-to-end framework for discriminative, language-conditioned manipulation with clear improvements in sample efficiency and task differentiation.

Abstract

Developing robots capable of executing various manipulation tasks, guided by natural language instructions and visual observations of intricate real-world environments, remains a significant challenge in robotics. Such robot agents need to understand linguistic commands and distinguish between the requirements of different tasks. In this work, we present Sigma-Agent, an end-to-end imitation learning agent for multi-task robotic manipulation. Sigma-Agent incorporates contrastive Imitation Learning (contrastive IL) modules to strengthen vision-language and current-future representations. An effective and efficient multi-view querying Transformer (MVQ-Former) for aggregating representative semantic information is introduced. Sigma-Agent shows substantial improvement over state-of-the-art methods under diverse settings in 18 RLBench tasks, surpassing RVT by an average of 5.2% and 5.9% in 10 and 100 demonstration training, respectively. Sigma-Agent also achieves 62% success rate with a single policy in 5 real-world manipulation tasks. The code will be released upon acceptance.
Paper Structure (25 sections, 6 equations, 7 figures, 7 tables)

This paper contains 25 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Left: t-SNE van2008tsne visualization of multi-task representation learning without/with contrastive IL, and learning with contrastive IL shows a much more obvious separation of features belonging to different tasks. Right: Visualize the interested regions with Grad-CAM selvaraju2017grad, which shows accurate object-level understanding.
  • Figure 2: (a) The pipeline of $\mathtt{\Sigma\hbox{-}agent}$ . (b) The overview of imitation learning for language-conditioned multi-task manipulation, where representation $\phi, \psi, \delta$ and policy network $\theta$ are learned for policy $\pi_\theta$ to mimic target policy $\pi^+$. Red Line: The contrastive IL modules aim to refine visual representation $\phi$ (visual encoder) and joint vision language representation $\delta$ (MVQ-Former and Feature Fusion). Note that the contrastive IL module is only for the training of agents, and makes no difference to the inference process. The visual encoder for the future states in contrastive IL module shares parameters with the visual encoder of current states. The language encoder is kept frozen during training.
  • Figure 3: Ablation experiments. (a). The success rate of $\mathtt{\Sigma\hbox{-}agent}$ ablating language & goal contrastive learning with current observations. (b). The success rate of $\mathtt{\Sigma\hbox{-}agent}$ ablating batch size of contrastive IL. (c). The success rate of $\mathtt{\Sigma\hbox{-}agent}$ ablating different $\lambda$.
  • Figure 4: Examples of the 18 RLBench tasks (front view) with corresponding human instructions.
  • Figure 5: We visualize the similarity of a key-point trajectory of close jar task with multiple tasks' language instructions. Training with contrastive IL module maximizes the similarity between the visual observations and related language instructions (deeper color), reducing the similarity with negative instructions (lighter color).
  • ...and 2 more figures