Table of Contents
Fetching ...

CloseUpShot: Close-up Novel View Synthesis from Sparse-views via Point-conditioned Diffusion Model

Yuqi Zhang, Guanying Chen, Jiaxing Chen, Chuanyu Fu, Chuan Huang, Shuguang Cui

TL;DR

This paper presents a diffusion-based framework, called CloseUpShot, for close-up novel view synthesis from sparse inputs via point-conditioned video diffusion, and proposes hierarchical warping and occlusion-aware noise suppression, enhancing the quality and completeness of the conditioning images for the video diffusion model.

Abstract

Reconstructing 3D scenes and synthesizing novel views from sparse input views is a highly challenging task. Recent advances in video diffusion models have demonstrated strong temporal reasoning capabilities, making them a promising tool for enhancing reconstruction quality under sparse-view settings. However, existing approaches are primarily designed for modest viewpoint variations, which struggle in capturing fine-grained details in close-up scenarios since input information is severely limited. In this paper, we present a diffusion-based framework, called CloseUpShot, for close-up novel view synthesis from sparse inputs via point-conditioned video diffusion. Specifically, we observe that pixel-warping conditioning suffers from severe sparsity and background leakage in close-up settings. To address this, we propose hierarchical warping and occlusion-aware noise suppression, enhancing the quality and completeness of the conditioning images for the video diffusion model. Furthermore, we introduce global structure guidance, which leverages a dense fused point cloud to provide consistent geometric context to the diffusion process, to compensate for the lack of globally consistent 3D constraints in sparse conditioning inputs. Extensive experiments on multiple datasets demonstrate that our method outperforms existing approaches, especially in close-up novel view synthesis, clearly validating the effectiveness of our design.

CloseUpShot: Close-up Novel View Synthesis from Sparse-views via Point-conditioned Diffusion Model

TL;DR

This paper presents a diffusion-based framework, called CloseUpShot, for close-up novel view synthesis from sparse inputs via point-conditioned video diffusion, and proposes hierarchical warping and occlusion-aware noise suppression, enhancing the quality and completeness of the conditioning images for the video diffusion model.

Abstract

Reconstructing 3D scenes and synthesizing novel views from sparse input views is a highly challenging task. Recent advances in video diffusion models have demonstrated strong temporal reasoning capabilities, making them a promising tool for enhancing reconstruction quality under sparse-view settings. However, existing approaches are primarily designed for modest viewpoint variations, which struggle in capturing fine-grained details in close-up scenarios since input information is severely limited. In this paper, we present a diffusion-based framework, called CloseUpShot, for close-up novel view synthesis from sparse inputs via point-conditioned video diffusion. Specifically, we observe that pixel-warping conditioning suffers from severe sparsity and background leakage in close-up settings. To address this, we propose hierarchical warping and occlusion-aware noise suppression, enhancing the quality and completeness of the conditioning images for the video diffusion model. Furthermore, we introduce global structure guidance, which leverages a dense fused point cloud to provide consistent geometric context to the diffusion process, to compensate for the lack of globally consistent 3D constraints in sparse conditioning inputs. Extensive experiments on multiple datasets demonstrate that our method outperforms existing approaches, especially in close-up novel view synthesis, clearly validating the effectiveness of our design.

Paper Structure

This paper contains 19 sections, 10 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Given sparse-view inputs, we propose CloseUpShot, a novel-view synthesis framework that leverages diffusion prior to generate high-fidelity close-up images and support detail-preserving 3D reconstruction, especially when users move forward or zoom in (e.g., the original green camera move forward to the close-up blue camera in the left column) for fine-grained inspection.
  • Figure 2: Limitations of point-conditioned diffusion models. (a) Given sparse input views, we extract point cloud, which is projected into a novel view to serve as conditioning (b) for the diffusion model (c). When the target view is similar to the input views (e.g., the regular view), the projection is dense and offers effective guidance. However, for close-up views that require zooming in or moving closer, the projected conditioning becomes sparse and incomplete. These weak conditioning signals fail to guide the diffusion model effectively, leading to low-fidelity and artifact-prone outputs.
  • Figure 3: Overview. Our pipeline takes two sparse input views and is capable of synthesizing fine-grained novel views under close-up settings using a point-conditioned video diffusion model. First, a pretrained estimator is applied to obtain depth maps and camera parameters from the input images. Second, we introduce two effective modules, hierarchical warping and occlusion-aware noise suppression, to enhance the sparse and noisy conditioning images, especially in the close-up setting. Third, we perform a multi-view consistency check to construct a global point cloud, which is projected into target views to provide global structure guidance for the denoising U-Net. Finally, the generated novel views, together with the reference inputs, are used to supervise 3DGS for photorealistic and detail-preserving 3D reconstruction.
  • Figure 4: Hierarchical Warping for Diffusion Conditioning. We perform forward warping at both high and low resolutions to obtain a sharp but sparse high-resolution image and a blurry but dense low-resolution image. The low-resolution result is then upsampled to fill the missing regions in the high-resolution image, producing a dense conditioning input for the diffusion model. Note that we only illustrate the reliable regions for simplicity.
  • Figure 5: Problem of background leakage. When close-up viewing, background points often leak through gaps in the sparse foreground, leading to incorrect projections conditioning images. Our noise suppression strategy mitigates this issue by filtering out these artifacts, resulting in more reliable and cleaner conditioning images for diffusion generation.
  • ...and 6 more figures