Table of Contents
Fetching ...

Reinforcement Learning with Generalizable Gaussian Splatting

Jiaxu Wang, Qiang Zhang, Jingkai Sun, Jiahang Cao, Gang Han, Wen Zhao, Weining Zhang, Yecheng Shao, Yijie Guo, Renjing Xu

TL;DR

This work tackles the critical challenge of environment representation in vision-based reinforcement learning by introducing a generalizable 3D Gaussian Splatting framework (GSRL). It learns a pretrained image-conditioned Gaussian encoder that converts multi-view observations into a 3D Gaussian cloud, enabling differentiable, geometry-aware representations without per-scene optimization. The approach combines depth estimation, per-pixel Gaussian property regression, and refinement to yield 3D-consistent scene representations that feed RL policies. Evaluations on RoboMimic show that GSRL improves performance and stability across multiple tasks and offline RL algorithms, highlighting its practical impact for robust, vision-based robotic control.

Abstract

An excellent representation is crucial for reinforcement learning (RL) performance, especially in vision-based reinforcement learning tasks. The quality of the environment representation directly influences the achievement of the learning task. Previous vision-based RL typically uses explicit or implicit ways to represent environments, such as images, points, voxels, and neural radiance fields. However, these representations contain several drawbacks. They cannot either describe complex local geometries or generalize well to unseen scenes, or require precise foreground masks. Moreover, these implicit neural representations are akin to a ``black box", significantly hindering interpretability. 3D Gaussian Splatting (3DGS), with its explicit scene representation and differentiable rendering nature, is considered a revolutionary change for reconstruction and representation methods. In this paper, we propose a novel Generalizable Gaussian Splatting framework to be the representation of RL tasks, called GSRL. Through validation in the RoboMimic environment, our method achieves better results than other baselines in multiple tasks, improving the performance by 10%, 44%, and 15% compared with baselines on the hardest task. This work is the first attempt to leverage generalizable 3DGS as a representation for RL.

Reinforcement Learning with Generalizable Gaussian Splatting

TL;DR

This work tackles the critical challenge of environment representation in vision-based reinforcement learning by introducing a generalizable 3D Gaussian Splatting framework (GSRL). It learns a pretrained image-conditioned Gaussian encoder that converts multi-view observations into a 3D Gaussian cloud, enabling differentiable, geometry-aware representations without per-scene optimization. The approach combines depth estimation, per-pixel Gaussian property regression, and refinement to yield 3D-consistent scene representations that feed RL policies. Evaluations on RoboMimic show that GSRL improves performance and stability across multiple tasks and offline RL algorithms, highlighting its practical impact for robust, vision-based robotic control.

Abstract

An excellent representation is crucial for reinforcement learning (RL) performance, especially in vision-based reinforcement learning tasks. The quality of the environment representation directly influences the achievement of the learning task. Previous vision-based RL typically uses explicit or implicit ways to represent environments, such as images, points, voxels, and neural radiance fields. However, these representations contain several drawbacks. They cannot either describe complex local geometries or generalize well to unseen scenes, or require precise foreground masks. Moreover, these implicit neural representations are akin to a ``black box", significantly hindering interpretability. 3D Gaussian Splatting (3DGS), with its explicit scene representation and differentiable rendering nature, is considered a revolutionary change for reconstruction and representation methods. In this paper, we propose a novel Generalizable Gaussian Splatting framework to be the representation of RL tasks, called GSRL. Through validation in the RoboMimic environment, our method achieves better results than other baselines in multiple tasks, improving the performance by 10%, 44%, and 15% compared with baselines on the hardest task. This work is the first attempt to leverage generalizable 3DGS as a representation for RL.
Paper Structure (14 sections, 12 equations, 2 figures, 4 tables)

This paper contains 14 sections, 12 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The overview of the main pipeline. Contents in the blue dashed line represent the training of the generalizable Gaussian prediction module. This module converts image observation into a 3D-consistent and geometry-aware 3D Gaussian cloud. Contents in the orange dashed line denote the RL training module, which is fed with the reconstructed 3D Gaussians to predict the target actions.
  • Figure 2: We evaluate our method in four tasks. A is Transport, meaning two robot arms must take the red box into the target container and transport the hammer to the opposite collaboratively. B is Lift, which forces the arm to take the red box up. C is Can. In this task, the robot arm needs to take the can into a specific area showing the can symbol. D is Square task designed for placing the hollow square object with a handle on the square pillar.