Multi-view Disentanglement for Reinforcement Learning with Multiple Cameras
Mhairi Dunion, Stefano V. Albrecht
TL;DR
This paper tackles the challenge of camera-dependent performance in image-based reinforcement learning by introducing Multi-View Disentanglement (MVD), a self-supervised auxiliary task that learns a shared camera-agnostic representation and a private camera-specific representation from multiple views. By deploying encoders that produce $\mathbf{s}_t^{c_i}$ and $\mathbf{p}_t^{c_i}$ and forming policy inputs $\mathbf{z}_t = (\mathbf{s}_t^{c_i}, \mathbf{p}_t^{c_j})$, MVD enables robust policy learning and zero-shot generalization to any single camera from the training set. The method relies on two InfoNCE-based losses, $\mathcal{L}^{\text{S}}$ and $\mathcal{L}^{\text{P}}$, to align shared representations across views while disentangling camera-specific information, culminating in $\mathcal{L}^{\text{MVD}} = \mathcal{L}^{\text{S}} + \mathcal{L}^{\text{P}}$ and joint RL optimization. Experimental results on Panda Gym and MetaWorld tasks demonstrate that MVD can achieve performance comparable to multi-camera training while offering robust zero-shot generalization to single-camera deployment, outperforming single-camera baselines and VIB baselines in many tasks. This approach is practically significant for real-world robotics where multi-camera setups may be unavailable or unreliable during deployment, enabling reliable control from a single camera when needed.
Abstract
The performance of image-based Reinforcement Learning (RL) agents can vary depending on the position of the camera used to capture the images. Training on multiple cameras simultaneously, including a first-person egocentric camera, can leverage information from different camera perspectives to improve the performance of RL. However, hardware constraints may limit the availability of multiple cameras in real-world deployment. Additionally, cameras may become damaged in the real-world preventing access to all cameras that were used during training. To overcome these hardware constraints, we propose Multi-View Disentanglement (MVD), which uses multiple cameras to learn a policy that is robust to a reduction in the number of cameras to generalise to any single camera from the training set. Our approach is a self-supervised auxiliary task for RL that learns a disentangled representation from multiple cameras, with a shared representation that is aligned across all cameras to allow generalisation to a single camera, and a private representation that is camera-specific. We show experimentally that an RL agent trained on a single third-person camera is unable to learn an optimal policy in many control tasks; but, our approach, benefiting from multiple cameras during training, is able to solve the task using only the same single third-person camera.
