Table of Contents
Fetching ...

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

Rickmer Krohn, Vignesh Prasad, Gabriele Tiboni, Georgia Chalvatzaki

TL;DR

MSDP tackles multisensory reinforcement learning for contact-rich manipulation by learning expressive latent representations through masked autoencoding across vision, proprioception, and force-torque signals. It decouples representation learning from policy optimization and introduces an asymmetric latent bridging: a cross-attention-based critic leverages dynamic task-specific features from frozen embeddings, while the actor relies on a stable pooled representation. Empirical results in simulation and real-world tasks show faster learning, robustness to sensor noise and changing dynamics, and near-optimal performance with around 6,000 online interactions. The approach scales to more modalities and can pretrain with sensors not present during policy execution, offering a practical, data-efficient solution for complex multisensory robotic control.

Abstract

Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control.

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

TL;DR

MSDP tackles multisensory reinforcement learning for contact-rich manipulation by learning expressive latent representations through masked autoencoding across vision, proprioception, and force-torque signals. It decouples representation learning from policy optimization and introduces an asymmetric latent bridging: a cross-attention-based critic leverages dynamic task-specific features from frozen embeddings, while the actor relies on a stable pooled representation. Empirical results in simulation and real-world tasks show faster learning, robustness to sensor noise and changing dynamics, and near-optimal performance with around 6,000 online interactions. The approach scales to more modalities and can pretrain with sensors not present during policy execution, offering a practical, data-efficient solution for complex multisensory robotic control.

Abstract

Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control.

Paper Structure

This paper contains 17 sections, 10 figures.

Figures (10)

  • Figure 1: Multisensory Dynamic Pretraining fuses multiple sensors, like human senses, to solve complex contact-rich manipulation tasks.
  • Figure 2: The MSDP framework with MSDP-Encoder (left), Pretraining (top right) and downstream RL (bottom right): The current multisensory observation gets projected with a CNN-stem and linear layers to the embedding space. The MSDP-encoder fuses all sensor embeddings to form our expressive multisensory latent representation. The encoder is trained via the decoder and (next) sensor observation reconstruction from a subset of sensor embeddings. This pretraining results in dynamic cross-sensor prediction, shaping and fusing sensor representations. For downstream RL we extract multisensory task-specific features via a single cross-attention layer for the critic and via pooling for the actor. Sensor embeddings are only masked during pretraining. Our Framework offers an expressive and robust multisensory representation for complex contact-rich manipulation tasks in simulation and the real world.
  • Figure 3: Multisensory contact-rich robot environments
  • Figure 4: Performance comparison between MSDP-P and MSDP-R to the baselines in Peg Insertion, Push Cube, Close Drawer Gently and Dual Arm Peg Insertion. Our method significantly accelerates RL training and achieves the highest final success rate across all tasks.
  • Figure 5: Peg Insertion Sensor Ablation: Proprioception is crucial to identify the precise peg pose under vision noise, while the force torque sensor allows for precise exploration around the hole, resulting in consistent insertions. Vision only is not able to achieve a high success rate.
  • ...and 5 more figures