Combining Reconstruction and Contrastive Methods for Multimodal Representations in RL

Philipp Becker; Sebastian Mossburger; Fabian Otto; Gerhard Neumann

Combining Reconstruction and Contrastive Methods for Multimodal Representations in RL

Philipp Becker, Sebastian Mossburger, Fabian Otto, Gerhard Neumann

TL;DR

CoRAL addresses the challenge of learning robust multimodal state representations for reinforcement learning by combining reconstruction-based and contrastive losses across sensor modalities within a recurrent state-space model. It introduces two instantiations, Variational CoRAL and Predictive CoRAL, that swap reconstruction terms for mutual-information terms using the InfoNCE bound, enabling modality-specific loss selection (e.g., reconstruction for proprioception and contrastive for images). The framework is validated on diverse suites with distractions and occlusions (Video Backgrounds, Occlusions), a Locomotion suite, and a ManiSkill2-based Manipulation suite, showing significant improvements over single-loss or naive fusion baselines, especially for model-based RL under challenging visual conditions. Overall, CoRAL demonstrates that careful modality-aware loss design in state-space representations can markedly improve both sample efficiency and task performance in multimodal RL, with practical implications for sensor fusion in real-world robotics and vision-based control.

Abstract

Learning self-supervised representations using reconstruction or contrastive losses improves performance and sample complexity of image-based and multimodal reinforcement learning (RL). Here, different self-supervised loss functions have distinct advantages and limitations depending on the information density of the underlying sensor modality. Reconstruction provides strong learning signals but is susceptible to distractions and spurious information. While contrastive approaches can ignore those, they may fail to capture all relevant details and can lead to representation collapse. For multimodal RL, this suggests that different modalities should be treated differently based on the amount of distractions in the signal. We propose Contrastive Reconstructive Aggregated representation Learning (CoRAL), a unified framework enabling us to choose the most appropriate self-supervised loss for each sensor modality and allowing the representation to better focus on relevant aspects. We evaluate CoRAL's benefits on a wide range of tasks with images containing distractions or occlusions, a new locomotion suite, and a challenging manipulation suite with visually realistic distractions. Our results show that learning a multimodal representation by combining contrastive and reconstruction-based losses can significantly improve performance and solve tasks that are out of reach for more naive representation learning approaches and other recent baselines.

Combining Reconstruction and Contrastive Methods for Multimodal Representations in RL

TL;DR

Abstract

Paper Structure (26 sections, 8 equations, 22 figures, 4 tables)

This paper contains 26 sections, 8 equations, 22 figures, 4 tables.

Introduction
Related Work
Combining Contrastive Approaches and Reconstruction for State Space Represntations
Learning the State Space Representation
Learning to Act Based on the Representation
Experiments
Modified Deep Mind Control Suite Tasks
Locomotion Suite
Manipulation Suite
Discussion
Conclusion
Environments
DeepMind Control Suite Tasks
Natural Background.
Occlusions.
...and 11 more sections

Figures (22)

Figure 1: Contrastive Reconstructive Aggregated representation Learning (CoRAL) learns multimodal state space representations of all available sensors using a combination of reconstruction-based and contrastive objectives. Building on the insight that we can exchange likelihood-based reconstruction with contrastive approaches using mutual information, allows us to choose an appropriate loss function for each modality. Motivated by both a variational and predictive coding viewpoint, CoRAL helps model-free and model-based agents to excel in challenging tasks that require information fusion from sensors with different properties such as images and proprioception.
Figure 2: Aggregated performance after $10^6$ environment steps on the $7$ tasks from the Video Background suite (IQM and $95 \%$ CIs). For both model-free and model-based RL, V-CoRAL performs best among all considered methods, with the model-free performance being better than the model-based one. While some of the model-free ablations are competitive, they perform considerably worse in the model-based case. From the baselines, only DrQ-v2 with additional proprioception, RePo (with and without proprioception), and DreamerPro get a final return of over $200$. These results demonstrate how including readily available proprioception with appropriate losses for each modality helps to learn accurate dynamics required by model-based RL and provides a simple alternative to more tailored approaches.
Figure 3: Aggregated performance after $10^6$ environment steps on the $7$ tasks from the Occlusion suite (IQM and $95 \%$ CIs). For both, model-free and model-based RL, P-CoRAL performs best among all considered methods, with the model-free version again outperforming its model-based counterpart. While all approaches handle Occlusions worse than VideoBackgorund, the performance drop is generally larger for the ablations and baselines. In particular, the Concat and model-based Same-Loss ablations suffer and no approach using only a single modality achieves an expected return of over $200$. This indicates the importance of learning a multimodal representation using tailored losses over naively integrating proprioception.
Figure 4: Left: Saliency Maps showing on which pixels the respective representation learning approaches focus in an example from Video Prediction. V-CoRAL focuses better on the task-relevant cheetah, while the corresponding contrastive variational Img-Only approach is more distracted by the video background. Right: For this Occlusion task, we train a separate decoder to reconstruct the occlusion-free ground truth from the (detached) latent representation. For Cartpole Swingup only the cart position is part of the proprioception. Still, P-CoRAL can capture both cart position and pole angle, while the contrastive predictive Img-Only approach fails to do so.
Figure 5: Left: Exemplary egocentric (upper row) and external example images (lower row) for the Hurdle Cheetah Run, Hurdle Walker Run, Ants Walls, and Quadruped Escape tasks of the Locomotion suite. Only the egocentric images are given to the agents, while the external images are solely for visualization of the tasks. Right: Aggregated performance on model-free agents and RePo after $10^6$ environment steps on the $6$ tasks of the Locomotion suite (IQM and $95 \%$ CIs). P-CoRAL significantly outperforms all ablative variants and baselines, highlighting how combining contrastive methods and reconstruction can form effective multimodal representations. It also outperforms purely reconstruction-based approaches, even with no distraction in the images.
...and 17 more figures

Combining Reconstruction and Contrastive Methods for Multimodal Representations in RL

TL;DR

Abstract

Combining Reconstruction and Contrastive Methods for Multimodal Representations in RL

Authors

TL;DR

Abstract

Table of Contents

Figures (22)