Reinforcement Learning with Generalizable Gaussian Splatting
Jiaxu Wang, Qiang Zhang, Jingkai Sun, Jiahang Cao, Gang Han, Wen Zhao, Weining Zhang, Yecheng Shao, Yijie Guo, Renjing Xu
TL;DR
This work tackles the critical challenge of environment representation in vision-based reinforcement learning by introducing a generalizable 3D Gaussian Splatting framework (GSRL). It learns a pretrained image-conditioned Gaussian encoder that converts multi-view observations into a 3D Gaussian cloud, enabling differentiable, geometry-aware representations without per-scene optimization. The approach combines depth estimation, per-pixel Gaussian property regression, and refinement to yield 3D-consistent scene representations that feed RL policies. Evaluations on RoboMimic show that GSRL improves performance and stability across multiple tasks and offline RL algorithms, highlighting its practical impact for robust, vision-based robotic control.
Abstract
An excellent representation is crucial for reinforcement learning (RL) performance, especially in vision-based reinforcement learning tasks. The quality of the environment representation directly influences the achievement of the learning task. Previous vision-based RL typically uses explicit or implicit ways to represent environments, such as images, points, voxels, and neural radiance fields. However, these representations contain several drawbacks. They cannot either describe complex local geometries or generalize well to unseen scenes, or require precise foreground masks. Moreover, these implicit neural representations are akin to a ``black box", significantly hindering interpretability. 3D Gaussian Splatting (3DGS), with its explicit scene representation and differentiable rendering nature, is considered a revolutionary change for reconstruction and representation methods. In this paper, we propose a novel Generalizable Gaussian Splatting framework to be the representation of RL tasks, called GSRL. Through validation in the RoboMimic environment, our method achieves better results than other baselines in multiple tasks, improving the performance by 10%, 44%, and 15% compared with baselines on the hardest task. This work is the first attempt to leverage generalizable 3DGS as a representation for RL.
