Table of Contents
Fetching ...

DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving

Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Yining Shi, Chuang Zhang, Sifa Zheng

TL;DR

DriveCamSim presents Explicit Camera Modeling (ECM) to enable generalizable camera simulation for autonomous driving by establishing explicit pixel-wise correspondences across multi-view and multi-frame interactions in 3D space. It introduces an overlap-based view matching strategy and a random frame sampling scheme to improve context relevance and temporal diversity, along with an information-preserving, potentially identity-aware conditioning mechanism that maintains 3D geometry in both encoding and injection. The framework, built on a pretrained latent diffusion model, achieves state-of-the-art realism, controllability, and temporal consistency on nuScenes, and demonstrates strong generalization across varying camera parameters and frame rates. Ablation studies validate the necessity of ECM, the sampling strategy, and the identity-aware conditioning for foreground/background fidelity and planning relevance, indicating practical impact for configurable AD simulation and downstream tasks.

Abstract

Camera sensor simulation serves as a critical role for autonomous driving (AD), e.g. evaluating vision-based AD algorithms. While existing approaches have leveraged generative models for controllable image/video generation, they remain constrained to generating multi-view video sequences with fixed camera viewpoints and video frequency, significantly limiting their downstream applications. To address this, we present a generalizable camera simulation framework DriveCamSim, whose core innovation lies in the proposed Explicit Camera Modeling (ECM) mechanism. Instead of implicit interaction through vanilla attention, ECM establishes explicit pixel-wise correspondences across multi-view and multi-frame dimensions, decoupling the model from overfitting to the specific camera configurations (intrinsic/extrinsic parameters, number of views) and temporal sampling rates presented in the training data. For controllable generation, we identify the issue of information loss inherent in existing conditional encoding and injection pipelines, proposing an information-preserving control mechanism. This control mechanism not only improves conditional controllability, but also can be extended to be identity-aware to enhance temporal consistency in foreground object rendering. With above designs, our model demonstrates superior performance in both visual quality and controllability, as well as generalization capability across spatial-level (camera parameters variations) and temporal-level (video frame rate variations), enabling flexible user-customizable camera simulation tailored to diverse application scenarios. Code will be avaliable at https://github.com/swc-17/DriveCamSim for facilitating future research.

DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving

TL;DR

DriveCamSim presents Explicit Camera Modeling (ECM) to enable generalizable camera simulation for autonomous driving by establishing explicit pixel-wise correspondences across multi-view and multi-frame interactions in 3D space. It introduces an overlap-based view matching strategy and a random frame sampling scheme to improve context relevance and temporal diversity, along with an information-preserving, potentially identity-aware conditioning mechanism that maintains 3D geometry in both encoding and injection. The framework, built on a pretrained latent diffusion model, achieves state-of-the-art realism, controllability, and temporal consistency on nuScenes, and demonstrates strong generalization across varying camera parameters and frame rates. Ablation studies validate the necessity of ECM, the sampling strategy, and the identity-aware conditioning for foreground/background fidelity and planning relevance, indicating practical impact for configurable AD simulation and downstream tasks.

Abstract

Camera sensor simulation serves as a critical role for autonomous driving (AD), e.g. evaluating vision-based AD algorithms. While existing approaches have leveraged generative models for controllable image/video generation, they remain constrained to generating multi-view video sequences with fixed camera viewpoints and video frequency, significantly limiting their downstream applications. To address this, we present a generalizable camera simulation framework DriveCamSim, whose core innovation lies in the proposed Explicit Camera Modeling (ECM) mechanism. Instead of implicit interaction through vanilla attention, ECM establishes explicit pixel-wise correspondences across multi-view and multi-frame dimensions, decoupling the model from overfitting to the specific camera configurations (intrinsic/extrinsic parameters, number of views) and temporal sampling rates presented in the training data. For controllable generation, we identify the issue of information loss inherent in existing conditional encoding and injection pipelines, proposing an information-preserving control mechanism. This control mechanism not only improves conditional controllability, but also can be extended to be identity-aware to enhance temporal consistency in foreground object rendering. With above designs, our model demonstrates superior performance in both visual quality and controllability, as well as generalization capability across spatial-level (camera parameters variations) and temporal-level (video frame rate variations), enabling flexible user-customizable camera simulation tailored to diverse application scenarios. Code will be avaliable at https://github.com/swc-17/DriveCamSim for facilitating future research.

Paper Structure

This paper contains 21 sections, 3 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Instead of (a) implicit camera modeling in 2D image space, we propose (b) explicit camera modeling in 3D physical world to unleash the (c) spatial-level and (d) temporal-level generalization capabilities for flexible camera simulation.
  • Figure 2: Overall framework of DriveCamSim. The (a) proposed method is built upon a pretrained latent diffusion modelldm, with several (b) attention layers and (c) control layers inserted.
  • Figure 3: Frame sampling strategy for training and inference.
  • Figure 4: Our control mechanism preserves information in encoding and injection stage, and support identity feature encoding.
  • Figure 5: Qualitative results for spatial-level generalization. Rotate front camera 20° to the left, DriveCamSim succeed to generate images with correct foreground and background, while MagicDrive and DreamForge fails.
  • ...and 12 more figures