Table of Contents
Fetching ...

Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis

Thang-Anh-Quan Nguyen, Nathan Piasco, Luis Roldão, Moussab Bennehar, Dzmitry Tsishkou, Laurent Caraffa, Jean-Philippe Tarel, Roland Brémond

TL;DR

<3-5 sentence high-level summary> PointmapDiff tackles the challenge of extrapolated novel-view synthesis in urban driving by conditioning a pre-trained 2D diffusion model on point maps that encode 3D geometry, and by introducing a reference-guided cross-view attention mechanism. The approach uses a two-stream architecture with Pointmap ControlNets to fuse point-map geometry with diffusion features and a RefAttn module to transfer information from reference views, enabling geometry-consistent view extrapolation and LiDAR-aligned generation. Experiments on KITTI-360 and Waymo demonstrate strong performance and robustness to varying point-map inputs, including sparse LiDAR and dense depth, and show potential for distilling diffusion outputs into 3D representations like 3D Gaussian Splatting. Ablation studies and qualitative analyses reveal the necessity of both the point-map conditioning and cross-view attention, while also highlighting remaining limitations in dynamic scenes and residual blur, pointing to temporal extensions as future work.

Abstract

Synthesizing extrapolated views remains a difficult task, especially in urban driving scenes, where the only reliable sources of data are limited RGB captures and sparse LiDAR points. To address this problem, we present PointmapDiff, a framework for novel view synthesis that utilizes pre-trained 2D diffusion models. Our method leverages point maps (i.e., rasterized 3D scene coordinates) as a conditioning signal, capturing geometric and photometric priors from the reference images to guide the image generation process. With the proposed reference attention layers and ControlNet for point map features, PointmapDiff can generate accurate and consistent results across varying viewpoints while respecting geometric fidelity. Experiments on real-life driving data demonstrate that our method achieves high-quality generation with flexibility over point map conditioning signals (e.g., dense depth map or even sparse LiDAR points) and can be used to distill to 3D representations such as 3D Gaussian Splatting for improving view extrapolation.

Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis

TL;DR

<3-5 sentence high-level summary> PointmapDiff tackles the challenge of extrapolated novel-view synthesis in urban driving by conditioning a pre-trained 2D diffusion model on point maps that encode 3D geometry, and by introducing a reference-guided cross-view attention mechanism. The approach uses a two-stream architecture with Pointmap ControlNets to fuse point-map geometry with diffusion features and a RefAttn module to transfer information from reference views, enabling geometry-consistent view extrapolation and LiDAR-aligned generation. Experiments on KITTI-360 and Waymo demonstrate strong performance and robustness to varying point-map inputs, including sparse LiDAR and dense depth, and show potential for distilling diffusion outputs into 3D representations like 3D Gaussian Splatting. Ablation studies and qualitative analyses reveal the necessity of both the point-map conditioning and cross-view attention, while also highlighting remaining limitations in dynamic scenes and residual blur, pointing to temporal extensions as future work.

Abstract

Synthesizing extrapolated views remains a difficult task, especially in urban driving scenes, where the only reliable sources of data are limited RGB captures and sparse LiDAR points. To address this problem, we present PointmapDiff, a framework for novel view synthesis that utilizes pre-trained 2D diffusion models. Our method leverages point maps (i.e., rasterized 3D scene coordinates) as a conditioning signal, capturing geometric and photometric priors from the reference images to guide the image generation process. With the proposed reference attention layers and ControlNet for point map features, PointmapDiff can generate accurate and consistent results across varying viewpoints while respecting geometric fidelity. Experiments on real-life driving data demonstrate that our method achieves high-quality generation with flexibility over point map conditioning signals (e.g., dense depth map or even sparse LiDAR points) and can be used to distill to 3D representations such as 3D Gaussian Splatting for improving view extrapolation.
Paper Structure (26 sections, 13 equations, 18 figures, 6 tables)

This paper contains 26 sections, 13 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: PointmapDiff is a method that can perform extrapolated view synthesis in urban scenes. We present viewpoints generated at $45^\circ$ angle to the right (first row) and at $1.5m$ position to the left (second row). Our approach significantly outperforms the baselines when rendering viewpoints beyond the original recorded trajectory, whereas 3DGS kerbl20233d struggles with severe artifacts.
  • Figure 2: From a reference image and synchronized LiDAR scan, while the image can observe only a small part (blue) of the scene, the geometric information from the rest of the LiDAR scan (orange) can still be used to generate meaningful content. We label the cars that appear in both the LiDAR scan and the generated image in red, denoting the advantage of our method compared to other baselines.
  • Figure 3: Method overview. (left) PointmapDiff is trained in the latent space of a fixed VAE with encoder $\mathcal{E}$ and decoder $\mathcal{D}$. Given a reference RGB image $I^{r}$ and the corresponding geometry $D^{r}$, we obtain a pair of point maps $\{X^{r,t}, X^{t,t}\}$ as inputs. We predict the target image $I^{t}$ given the geometry signal from the target point map, and information comes from the reference U-Net. Particularly, two Pointmap ControlNets are employed to extract geometric feature correspondences and concatenate ⓒ them with the intermediate SD feature maps. We freeze the original SD model and only train the Pointmap ControlNet and the reference attention module. (right) We extract reference features using our reference U-Net. These augmented features are integrated into the target U-Net through a reference-guided cross-view attention mechanism, which is added $\bigoplus$ throughout the target U-Net.
  • Figure 4: Qualitative comparison on KITTI-360 liao2022kitti. We demonstrate three scenarios: rotating, shifting, and flying upward. The test view represents the conventional camera sampled from forward-facing trajectories. We also include training images that provide the best available coverage as a reference.
  • Figure 5: Qualitative comparison on Waymo sun2020scalability with different shifting distances.
  • ...and 13 more figures