Table of Contents
Fetching ...

EGSRAL: An Enhanced 3D Gaussian Splatting based Renderer with Automated Labeling for Large-Scale Driving Scene

Yixiong Huo, Guangfeng Jiang, Hongyang Wei, Ji Liu, Song Zhang, Han Liu, Xingliang Huang, Mingjie Lu, Jinzhang Peng, Dong Li, Lu Tian, Emad Barsoum

TL;DR

EGSRAL presents an enhanced 3D Gaussian Splatting renderer for large-scale driving scenes, integrating a Deformation Enhancement Module (DEM), an Opacity Enhancement Module (OEM), and a Grouping Strategy (GPS) to improve dynamic-object modeling and rendering efficiency. A novel adaptor enables automatic labeling by translating coordinates between coordinate systems and generating corresponding 2D/3D annotations for novel views, guided by a triad of losses and pose augmentation. Empirical results show state-of-the-art novel-view synthesis on KITTI and nuScenes datasets and notable improvements in downstream 2D/3D detection when using synthetic annotations, with ablations confirming the effectiveness of each component and the grouping strategy. The work demonstrates practical impact by reducing annotation dependence while delivering high-quality renderings suitable for autonomous driving perception pipelines. Overall, EGSRAL advances the integration of fast, high-fidelity 3D GS rendering with automated labeling for scalable driving-scene understanding.

Abstract

3D Gaussian Splatting (3D GS) has gained popularity due to its faster rendering speed and high-quality novel view synthesis. Some researchers have explored using 3D GS for reconstructing driving scenes. However, these methods often rely on various data types, such as depth maps, 3D boxes, and trajectories of moving objects. Additionally, the lack of annotations for synthesized images limits their direct application in downstream tasks. To address these issues, we propose EGSRAL, a 3D GS-based method that relies solely on training images without extra annotations. EGSRAL enhances 3D GS's capability to model both dynamic objects and static backgrounds and introduces a novel adaptor for auto labeling, generating corresponding annotations based on existing annotations. We also propose a grouping strategy for vanilla 3D GS to address perspective issues in rendering large-scale, complex scenes. Our method achieves state-of-the-art performance on multiple datasets without any extra annotation. For example, the PSNR metric reaches 29.04 on the nuScenes dataset. Moreover, our automated labeling can significantly improve the performance of 2D/3D detection tasks. Code is available at https://github.com/jiangxb98/EGSRAL.

EGSRAL: An Enhanced 3D Gaussian Splatting based Renderer with Automated Labeling for Large-Scale Driving Scene

TL;DR

EGSRAL presents an enhanced 3D Gaussian Splatting renderer for large-scale driving scenes, integrating a Deformation Enhancement Module (DEM), an Opacity Enhancement Module (OEM), and a Grouping Strategy (GPS) to improve dynamic-object modeling and rendering efficiency. A novel adaptor enables automatic labeling by translating coordinates between coordinate systems and generating corresponding 2D/3D annotations for novel views, guided by a triad of losses and pose augmentation. Empirical results show state-of-the-art novel-view synthesis on KITTI and nuScenes datasets and notable improvements in downstream 2D/3D detection when using synthetic annotations, with ablations confirming the effectiveness of each component and the grouping strategy. The work demonstrates practical impact by reducing annotation dependence while delivering high-quality renderings suitable for autonomous driving perception pipelines. Overall, EGSRAL advances the integration of fast, high-fidelity 3D GS rendering with automated labeling for scalable driving-scene understanding.

Abstract

3D Gaussian Splatting (3D GS) has gained popularity due to its faster rendering speed and high-quality novel view synthesis. Some researchers have explored using 3D GS for reconstructing driving scenes. However, these methods often rely on various data types, such as depth maps, 3D boxes, and trajectories of moving objects. Additionally, the lack of annotations for synthesized images limits their direct application in downstream tasks. To address these issues, we propose EGSRAL, a 3D GS-based method that relies solely on training images without extra annotations. EGSRAL enhances 3D GS's capability to model both dynamic objects and static backgrounds and introduces a novel adaptor for auto labeling, generating corresponding annotations based on existing annotations. We also propose a grouping strategy for vanilla 3D GS to address perspective issues in rendering large-scale, complex scenes. Our method achieves state-of-the-art performance on multiple datasets without any extra annotation. For example, the PSNR metric reaches 29.04 on the nuScenes dataset. Moreover, our automated labeling can significantly improve the performance of 2D/3D detection tasks. Code is available at https://github.com/jiangxb98/EGSRAL.

Paper Structure

This paper contains 29 sections, 10 equations, 10 figures, 17 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of overall EGSRAL. The EGSRAL framework begins by aligning the input image, followed by initializing the 3D Gaussian using the point cloud generated by the SfM. A deformable network (orange block) constructs the 3D Gaussian deformation field while the deformation enhancement module (DEM) (yellow blocks) refines this field. The opacity enhancement module (OEM) (blue blocks) optimizes opacity. To address perspective issues in large-scale, complex scenes, a group-based training and rendering strategy (green block) is employed (Section 3.2). Additionally, the adaptor is trained using three constraints (orange) to enhance its coordinate relationship modeling. During inference, these modules render and synthesize novel view images with corresponding annotations (Section 3.3).
  • Figure 2: Illustration of the grouping strategy.
  • Figure 3: Illustration of the adaptor module including model and transformation modules. Camera pose $P_{n_{nov}\_S}$ in SfM coordinate system is the corresponding pose of novel camera pose $P_{n_{nov}}$.
  • Figure 4: Qualitative comparison of novel view synthesis on the nuScenes dataset.
  • Figure 5: Visualizing 2D/3D auto labeling on nuScenes.
  • ...and 5 more figures