Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis
Thang-Anh-Quan Nguyen, Nathan Piasco, Luis Roldão, Moussab Bennehar, Dzmitry Tsishkou, Laurent Caraffa, Jean-Philippe Tarel, Roland Brémond
TL;DR
<3-5 sentence high-level summary> PointmapDiff tackles the challenge of extrapolated novel-view synthesis in urban driving by conditioning a pre-trained 2D diffusion model on point maps that encode 3D geometry, and by introducing a reference-guided cross-view attention mechanism. The approach uses a two-stream architecture with Pointmap ControlNets to fuse point-map geometry with diffusion features and a RefAttn module to transfer information from reference views, enabling geometry-consistent view extrapolation and LiDAR-aligned generation. Experiments on KITTI-360 and Waymo demonstrate strong performance and robustness to varying point-map inputs, including sparse LiDAR and dense depth, and show potential for distilling diffusion outputs into 3D representations like 3D Gaussian Splatting. Ablation studies and qualitative analyses reveal the necessity of both the point-map conditioning and cross-view attention, while also highlighting remaining limitations in dynamic scenes and residual blur, pointing to temporal extensions as future work.
Abstract
Synthesizing extrapolated views remains a difficult task, especially in urban driving scenes, where the only reliable sources of data are limited RGB captures and sparse LiDAR points. To address this problem, we present PointmapDiff, a framework for novel view synthesis that utilizes pre-trained 2D diffusion models. Our method leverages point maps (i.e., rasterized 3D scene coordinates) as a conditioning signal, capturing geometric and photometric priors from the reference images to guide the image generation process. With the proposed reference attention layers and ControlNet for point map features, PointmapDiff can generate accurate and consistent results across varying viewpoints while respecting geometric fidelity. Experiments on real-life driving data demonstrate that our method achieves high-quality generation with flexibility over point map conditioning signals (e.g., dense depth map or even sparse LiDAR points) and can be used to distill to 3D representations such as 3D Gaussian Splatting for improving view extrapolation.
