Table of Contents
Fetching ...

M2CURL: Sample-Efficient Multimodal Reinforcement Learning via Self-Supervised Representation Learning for Robotic Manipulation

Fotios Lygerakis, Vedant Dave, Elmar Rueckert

TL;DR

M2CURL addresses the challenge of sample-inefficient reinforcement learning in visuotactile robotic manipulation by learning robust multimodal representations through self-supervised contrastive learning. It introduces a modular architecture with online and momentum encoders and four modality-specific heads, jointly optimizing intra- and inter-modal InfoNCE losses into the RL objective via a flexible integration scheme $\mathcal{L}_{MM}$. The approach is algorithm-agnostic and validated on three tasks in the Tactile Gym 2, showing faster convergence and higher rewards than SAC and PPO baselines, including RAD variants and state-based comparisons. By leveraging visuotactile augmentations and a balanced multimodal loss, M2CURL offers a practical route to more data-efficient manipulation policies with potential extension to physical robots and additional sensing modalities.

Abstract

One of the most critical aspects of multimodal Reinforcement Learning (RL) is the effective integration of different observation modalities. Having robust and accurate representations derived from these modalities is key to enhancing the robustness and sample efficiency of RL algorithms. However, learning representations in RL settings for visuotactile data poses significant challenges, particularly due to the high dimensionality of the data and the complexity involved in correlating visual and tactile inputs with the dynamic environment and task objectives. To address these challenges, we propose Multimodal Contrastive Unsupervised Reinforcement Learning (M2CURL). Our approach employs a novel multimodal self-supervised learning technique that learns efficient representations and contributes to faster convergence of RL algorithms. Our method is agnostic to the RL algorithm, thus enabling its integration with any available RL algorithm. We evaluate M2CURL on the Tactile Gym 2 simulator and we show that it significantly enhances the learning efficiency in different manipulation tasks. This is evidenced by faster convergence rates and higher cumulative rewards per episode, compared to standard RL algorithms without our representation learning approach.

M2CURL: Sample-Efficient Multimodal Reinforcement Learning via Self-Supervised Representation Learning for Robotic Manipulation

TL;DR

M2CURL addresses the challenge of sample-inefficient reinforcement learning in visuotactile robotic manipulation by learning robust multimodal representations through self-supervised contrastive learning. It introduces a modular architecture with online and momentum encoders and four modality-specific heads, jointly optimizing intra- and inter-modal InfoNCE losses into the RL objective via a flexible integration scheme . The approach is algorithm-agnostic and validated on three tasks in the Tactile Gym 2, showing faster convergence and higher rewards than SAC and PPO baselines, including RAD variants and state-based comparisons. By leveraging visuotactile augmentations and a balanced multimodal loss, M2CURL offers a practical route to more data-efficient manipulation policies with potential extension to physical robots and additional sensing modalities.

Abstract

One of the most critical aspects of multimodal Reinforcement Learning (RL) is the effective integration of different observation modalities. Having robust and accurate representations derived from these modalities is key to enhancing the robustness and sample efficiency of RL algorithms. However, learning representations in RL settings for visuotactile data poses significant challenges, particularly due to the high dimensionality of the data and the complexity involved in correlating visual and tactile inputs with the dynamic environment and task objectives. To address these challenges, we propose Multimodal Contrastive Unsupervised Reinforcement Learning (M2CURL). Our approach employs a novel multimodal self-supervised learning technique that learns efficient representations and contributes to faster convergence of RL algorithms. Our method is agnostic to the RL algorithm, thus enabling its integration with any available RL algorithm. We evaluate M2CURL on the Tactile Gym 2 simulator and we show that it significantly enhances the learning efficiency in different manipulation tasks. This is evidenced by faster convergence rates and higher cumulative rewards per episode, compared to standard RL algorithms without our representation learning approach.
Paper Structure (20 sections, 10 equations, 3 figures, 2 tables)

This paper contains 20 sections, 10 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Representation of M2CURL features on a unit sphere of codes. This diagram illustrates the projection of visual and tactile observations from high-dimensional space onto a unit sphere, where both intra and inter-modality losses are computed. These losses form the core of the contrastive multimodal loss, essential in M2CURL's learning process.
  • Figure 2: The M2CURL Architecture: First, a batch of visuotactile observations are sampled from the replay buffer. Then, two random augmentations are applied for the query (online) and key (momentum) encoders, and their representation is computed. The query and key representations are used to compute the inter and intra-modality codes using the respective heads, from which the different inter and intra-modality losses are computed. Finally, the weighted sum of the sub-losses is passed to the RL algorithm as a combined multimodal contrastive loss $\mathcal{L}_{MM}$. Momentum encoders are denoted with *.
  • Figure 3: Performance comparison of SAC and PPO algorithms using visual or tactile observations, against M2CURL using visuotactile observations.