Hybrid Rendering for Multimodal Autonomous Driving: Merging Neural and Physics-Based Simulation
Máté Tóth, Péter Kovács, Réka Bencses, Zoltán Bendefy, Zoltán Hortsin, Balázs Teréki, Tamás Matuszka
TL;DR
This work introduces a hybrid rendering framework for autonomous driving simulation that combines neural reconstruction of static environments with traditional mesh-based dynamics to achieve controllability and multimodal sensor fidelity. The core innovation, NeRF2GS, blends NeRF-based depth-regularized training with 3D Gaussian Splatting to deliver high-quality novel-view synthesis at interactive speeds, while enabling dynamic object insertion and variable environmental conditions. A block-based training scheme supports large-scale reconstructions, and both rasterization and ray-tracing backends permit real-time camera simulation as well as precise LiDAR simulation. Evaluations on Waymo data demonstrate competitive static reconstruction quality, robust novel-view synthesis for road features, and favorable downstream task performance with limited domain gap, highlighting the method’s potential for scalable, sensor-accurate autonomous driving simulation.
Abstract
Neural reconstruction models for autonomous driving simulation have made significant strides in recent years, with dynamic models becoming increasingly prevalent. However, these models are typically limited to handling in-domain objects closely following their original trajectories. We introduce a hybrid approach that combines the strengths of neural reconstruction with physics-based rendering. This method enables the virtual placement of traditional mesh-based dynamic agents at arbitrary locations, adjustments to environmental conditions, and rendering from novel camera viewpoints. Our approach significantly enhances novel view synthesis quality -- especially for road surfaces and lane markings -- while maintaining interactive frame rates through our novel training method, NeRF2GS. This technique leverages the superior generalization capabilities of NeRF-based methods and the real-time rendering speed of 3D Gaussian Splatting (3DGS). We achieve this by training a customized NeRF model on the original images with depth regularization derived from a noisy LiDAR point cloud, then using it as a teacher model for 3DGS training. This process ensures accurate depth, surface normals, and camera appearance modeling as supervision. With our block-based training parallelization, the method can handle large-scale reconstructions (greater than or equal to 100,000 square meters) and predict segmentation masks, surface normals, and depth maps. During simulation, it supports a rasterization-based rendering backend with depth-based composition and multiple camera models for real-time camera simulation, as well as a ray-traced backend for precise LiDAR simulation.
