Table of Contents
Fetching ...

Multimodal Information Bottleneck for Deep Reinforcement Learning with Multiple Sensors

Bang You, Huaping Liu

TL;DR

It is argued that compressing information in the learned joint representations about raw multimodal observations is helpful, and a multimodal information bottleneck model is proposed to learn task-relevant joint representations from egocentric images and proprioception to minimize the upper bound of the bottleneck objective.

Abstract

Reinforcement learning has achieved promising results on robotic control tasks but struggles to leverage information effectively from multiple sensory modalities that differ in many characteristics. Recent works construct auxiliary losses based on reconstruction or mutual information to extract joint representations from multiple sensory inputs to improve the sample efficiency and performance of reinforcement learning algorithms. However, the representations learned by these methods could capture information irrelevant to learning a policy and may degrade the performance. We argue that compressing information in the learned joint representations about raw multimodal observations is helpful, and propose a multimodal information bottleneck model to learn task-relevant joint representations from egocentric images and proprioception. Our model compresses and retains the predictive information in multimodal observations for learning a compressed joint representation, which fuses complementary information from visual and proprioceptive feedback and meanwhile filters out task-irrelevant information in raw multimodal observations. We propose to minimize the upper bound of our multimodal information bottleneck objective for computationally tractable optimization. Experimental evaluations on several challenging locomotion tasks with egocentric images and proprioception show that our method achieves better sample efficiency and zero-shot robustness to unseen white noise than leading baselines. We also empirically demonstrate that leveraging information from egocentric images and proprioception is more helpful for learning policies on locomotion tasks than solely using one single modality.

Multimodal Information Bottleneck for Deep Reinforcement Learning with Multiple Sensors

TL;DR

It is argued that compressing information in the learned joint representations about raw multimodal observations is helpful, and a multimodal information bottleneck model is proposed to learn task-relevant joint representations from egocentric images and proprioception to minimize the upper bound of the bottleneck objective.

Abstract

Reinforcement learning has achieved promising results on robotic control tasks but struggles to leverage information effectively from multiple sensory modalities that differ in many characteristics. Recent works construct auxiliary losses based on reconstruction or mutual information to extract joint representations from multiple sensory inputs to improve the sample efficiency and performance of reinforcement learning algorithms. However, the representations learned by these methods could capture information irrelevant to learning a policy and may degrade the performance. We argue that compressing information in the learned joint representations about raw multimodal observations is helpful, and propose a multimodal information bottleneck model to learn task-relevant joint representations from egocentric images and proprioception. Our model compresses and retains the predictive information in multimodal observations for learning a compressed joint representation, which fuses complementary information from visual and proprioceptive feedback and meanwhile filters out task-irrelevant information in raw multimodal observations. We propose to minimize the upper bound of our multimodal information bottleneck objective for computationally tractable optimization. Experimental evaluations on several challenging locomotion tasks with egocentric images and proprioception show that our method achieves better sample efficiency and zero-shot robustness to unseen white noise than leading baselines. We also empirically demonstrate that leveraging information from egocentric images and proprioception is more helpful for learning policies on locomotion tasks than solely using one single modality.

Paper Structure

This paper contains 22 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of the objective of our MIB model. Our MIB objective minimizes the mutual information between the joint representation $z_t$ and the embeddings of the current image and proprioception for compression, while maximizing the predictive information $I(z_t, a_t; z_{t+1})$ for improving the latent temporal consistency.
  • Figure 2: The network architecture of our MIB model. We use an image encoder and a proprioception encoder to extract latent representations from the current egocentric image and proprioception, respectively. The obtained upper bound of the mutual information $I(z_t;c_t^p, c_t^i)$ is the KL divergence between the distribution of the joint representations $z_t$ given the extracted representations and the unit normal Gaussian distribution. The prediction head and the projection head are used to map the joint representations in a latent space, where the lower bound of the mutual information $I(z_t, a_t; z_{t+1})$ is computed.
  • Figure 3: Continuous locomotion tasks are used in our experiments, namely Hurdle Walker Walk/Run, Hurdle Cheetah Run, and Ant Empty. Three-person views for visualization and egocentric images provided to the agent are shown in the upper and bottom rows, respectively.
  • Figure 4: We compare the performance of our method to baselines on four challenging locomotion tasks. The plot shows the average reward and 95% confidence interval across 5 independent runs. For each run, the mean return is computed over 10 trajectories. Our method achieves better sample efficiency and performance than all baselines across all tasks.