Table of Contents
Fetching ...

Multi-view Disentanglement for Reinforcement Learning with Multiple Cameras

Mhairi Dunion, Stefano V. Albrecht

TL;DR

This paper tackles the challenge of camera-dependent performance in image-based reinforcement learning by introducing Multi-View Disentanglement (MVD), a self-supervised auxiliary task that learns a shared camera-agnostic representation and a private camera-specific representation from multiple views. By deploying encoders that produce $\mathbf{s}_t^{c_i}$ and $\mathbf{p}_t^{c_i}$ and forming policy inputs $\mathbf{z}_t = (\mathbf{s}_t^{c_i}, \mathbf{p}_t^{c_j})$, MVD enables robust policy learning and zero-shot generalization to any single camera from the training set. The method relies on two InfoNCE-based losses, $\mathcal{L}^{\text{S}}$ and $\mathcal{L}^{\text{P}}$, to align shared representations across views while disentangling camera-specific information, culminating in $\mathcal{L}^{\text{MVD}} = \mathcal{L}^{\text{S}} + \mathcal{L}^{\text{P}}$ and joint RL optimization. Experimental results on Panda Gym and MetaWorld tasks demonstrate that MVD can achieve performance comparable to multi-camera training while offering robust zero-shot generalization to single-camera deployment, outperforming single-camera baselines and VIB baselines in many tasks. This approach is practically significant for real-world robotics where multi-camera setups may be unavailable or unreliable during deployment, enabling reliable control from a single camera when needed.

Abstract

The performance of image-based Reinforcement Learning (RL) agents can vary depending on the position of the camera used to capture the images. Training on multiple cameras simultaneously, including a first-person egocentric camera, can leverage information from different camera perspectives to improve the performance of RL. However, hardware constraints may limit the availability of multiple cameras in real-world deployment. Additionally, cameras may become damaged in the real-world preventing access to all cameras that were used during training. To overcome these hardware constraints, we propose Multi-View Disentanglement (MVD), which uses multiple cameras to learn a policy that is robust to a reduction in the number of cameras to generalise to any single camera from the training set. Our approach is a self-supervised auxiliary task for RL that learns a disentangled representation from multiple cameras, with a shared representation that is aligned across all cameras to allow generalisation to a single camera, and a private representation that is camera-specific. We show experimentally that an RL agent trained on a single third-person camera is unable to learn an optimal policy in many control tasks; but, our approach, benefiting from multiple cameras during training, is able to solve the task using only the same single third-person camera.

Multi-view Disentanglement for Reinforcement Learning with Multiple Cameras

TL;DR

This paper tackles the challenge of camera-dependent performance in image-based reinforcement learning by introducing Multi-View Disentanglement (MVD), a self-supervised auxiliary task that learns a shared camera-agnostic representation and a private camera-specific representation from multiple views. By deploying encoders that produce and and forming policy inputs , MVD enables robust policy learning and zero-shot generalization to any single camera from the training set. The method relies on two InfoNCE-based losses, and , to align shared representations across views while disentangling camera-specific information, culminating in and joint RL optimization. Experimental results on Panda Gym and MetaWorld tasks demonstrate that MVD can achieve performance comparable to multi-camera training while offering robust zero-shot generalization to single-camera deployment, outperforming single-camera baselines and VIB baselines in many tasks. This approach is practically significant for real-world robotics where multi-camera setups may be unavailable or unreliable during deployment, enabling reliable control from a single camera when needed.

Abstract

The performance of image-based Reinforcement Learning (RL) agents can vary depending on the position of the camera used to capture the images. Training on multiple cameras simultaneously, including a first-person egocentric camera, can leverage information from different camera perspectives to improve the performance of RL. However, hardware constraints may limit the availability of multiple cameras in real-world deployment. Additionally, cameras may become damaged in the real-world preventing access to all cameras that were used during training. To overcome these hardware constraints, we propose Multi-View Disentanglement (MVD), which uses multiple cameras to learn a policy that is robust to a reduction in the number of cameras to generalise to any single camera from the training set. Our approach is a self-supervised auxiliary task for RL that learns a disentangled representation from multiple cameras, with a shared representation that is aligned across all cameras to allow generalisation to a single camera, and a private representation that is camera-specific. We show experimentally that an RL agent trained on a single third-person camera is unable to learn an optimal policy in many control tasks; but, our approach, benefiting from multiple cameras during training, is able to solve the task using only the same single third-person camera.
Paper Structure (35 sections, 8 equations, 10 figures, 2 tables)

This paper contains 35 sections, 8 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: First-person and third-person camera views for MetaWorld Soccer task.
  • Figure 2: Multi-view Disentanglement (MVD) architecture. Each camera image is used to generate a shared and private representation. The shared auxiliary loss $\mathcal{L}^{\text{S}}$ uses these representations to maximise similarity between shared representations and minimise similarity between shared and private representations. The private auxiliary loss $\mathcal{L}^{\text{P}}$ minimises similarity between private representations.
  • Figure 3: Camera views used for Panda tasks.
  • Figure 4: Results for Panda tasks showing success rate for evaluation on all cameras (left of dashed line) compared with success rate on each of the individual cameras (right). Success rate is averaged over 20 evaluation episodes for 5 seeds. The shaded region is standard deviation.
  • Figure 5: Results for MetaWorld tasks showing success rate for evaluation on all cameras (left of dashed line) compared with success rate on each of the individual cameras (right). Success rate is averaged over 20 evaluation episodes for 5 seeds. The shaded region is standard deviation.
  • ...and 5 more figures