Table of Contents
Fetching ...

A Single Image and Multimodality Is All You Need for Novel View Synthesis

Amirhosein Javadi, Chi-Shiang Gau, Konstantinos D. Polyzos, Tara Javidi

TL;DR

This work introduces a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis.

Abstract

Diffusion-based approaches have recently demonstrated strong performance for single-image novel view synthesis by conditioning generative models on geometry inferred from monocular depth estimation. However, in practice, the quality and consistency of the synthesized views are fundamentally limited by the reliability of the underlying depth estimates, which are often fragile under low texture, adverse weather, and occlusion-heavy real-world conditions. In this work, we show that incorporating sparse multimodal range measurements provides a simple yet effective way to overcome these limitations. We introduce a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis. Our approach models depth in an angular domain using a localized Gaussian Process formulation, enabling computationally efficient inference while explicitly quantifying uncertainty in regions with limited observations. The reconstructed depth and uncertainty are used as a drop-in replacement for monocular depth estimators in existing diffusion-based rendering pipelines, without modifying the generative model itself. Experiments on real-world multimodal driving scenes demonstrate that replacing vision-only depth with our sparse range-based reconstruction substantially improves both geometric consistency and visual quality in single-image novel-view video generation. These results highlight the importance of reliable geometric priors for diffusion-based view synthesis and demonstrate the practical benefits of multimodal sensing even at extreme levels of sparsity.

A Single Image and Multimodality Is All You Need for Novel View Synthesis

TL;DR

This work introduces a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis.

Abstract

Diffusion-based approaches have recently demonstrated strong performance for single-image novel view synthesis by conditioning generative models on geometry inferred from monocular depth estimation. However, in practice, the quality and consistency of the synthesized views are fundamentally limited by the reliability of the underlying depth estimates, which are often fragile under low texture, adverse weather, and occlusion-heavy real-world conditions. In this work, we show that incorporating sparse multimodal range measurements provides a simple yet effective way to overcome these limitations. We introduce a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis. Our approach models depth in an angular domain using a localized Gaussian Process formulation, enabling computationally efficient inference while explicitly quantifying uncertainty in regions with limited observations. The reconstructed depth and uncertainty are used as a drop-in replacement for monocular depth estimators in existing diffusion-based rendering pipelines, without modifying the generative model itself. Experiments on real-world multimodal driving scenes demonstrate that replacing vision-only depth with our sparse range-based reconstruction substantially improves both geometric consistency and visual quality in single-image novel-view video generation. These results highlight the importance of reliable geometric priors for diffusion-based view synthesis and demonstrate the practical benefits of multimodal sensing even at extreme levels of sparsity.
Paper Structure (9 sections, 14 equations, 3 figures, 2 tables)

This paper contains 9 sections, 14 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the proposed multimodal single-image novel-view synthesis pipeline. Sparse range sensing measurements are first processed by the proposed GP-based depth reconstruction module to produce a dense depth map. The reconstructed depth and the input RGB image are used to form a colored 3D point cloud, which is rendered along a target camera trajectory to generate sparse novel-view conditioning frames. These rendered frames are provided as geometric conditioning to a diffusion model, which synthesizes a temporally consistent video at the target viewpoints.
  • Figure 2: Illustration of the proposed localized Gaussian Process depth reconstruction in the angular domain. Left: Sparse radar range measurements represented in azimuth–elevation space, with the current target viewing direction indicated. Right: Zoomed-in view around the target location, highlighting the local neighborhood used for Gaussian Process inference. Only range measurements within this localized region contribute to the depth estimation at the target direction, enabling efficient and spatially adaptive depth reconstruction from sparse observations.
  • Figure 3: Qualitative comparison on single-image novel-view synthesis on the View-of-Delft dataset. From left to right, we show the input image, the novel view generated by GEN3C ren2025gen3c using its default monocular depth estimator, MoGe wang2025moge, the novel view generated by GEN3C when replacing monocular depth with our sparse range-sensor depth reconstruction module, and the ground-truth target view. For each generated view, we report LPIPS with respect to the ground truth (lower is better). Across all examples, our depth reconstruction yields consistently lower LPIPS and improved geometric alignment, underscoring the importance of reliable geometry for diffusion-based rendering from single-view inputs.