Table of Contents
Fetching ...

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Hanyang Wang, Fangfu Liu, Jiawei Chi, Yueqi Duan

TL;DR

VideoScene tackles the problem of generating 3D-consistent scenes from sparse views by distilling a pretrained video diffusion model into a one-step generator. It introduces a 3D-aware leap flow distillation that leverages a fast coarse 3DGS prior to seed diffusion-based generation, and a dynamic denoising policy network governed by a contextual-bandit formulation to adaptively select the denoising timestep. The method achieves faster inference and improved 3D structure fidelity compared to prior video diffusion approaches, with strong generalization across datasets and promising applicability to downstream 3D reconstruction pipelines. Overall, VideoScene offers a practical, efficient bridge from video priors to actionable 3D scene content, enabling real-time or near-real-time video-to-3D applications.

Abstract

Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: https://hanyang-21.github.io/VideoScene

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

TL;DR

VideoScene tackles the problem of generating 3D-consistent scenes from sparse views by distilling a pretrained video diffusion model into a one-step generator. It introduces a 3D-aware leap flow distillation that leverages a fast coarse 3DGS prior to seed diffusion-based generation, and a dynamic denoising policy network governed by a contextual-bandit formulation to adaptively select the denoising timestep. The method achieves faster inference and improved 3D structure fidelity compared to prior video diffusion approaches, with strong generalization across datasets and promising applicability to downstream 3D reconstruction pipelines. Overall, VideoScene offers a practical, efficient bridge from video priors to actionable 3D scene content, enabling real-time or near-real-time video-to-3D applications.

Abstract

Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: https://hanyang-21.github.io/VideoScene

Paper Structure

This paper contains 27 sections, 13 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: VideoScene enables one-step video generation of 3D scenes with strong structural consistency from just two input images. The top row shows the input sparse views and the following two rows show the output novel-view video frames.
  • Figure 2: Pipeline of VideoScene. Given input pair views, we first generate a coarse 3D representation with a rapid feed-forward 3DGS model (i.e., MVSplat chen2025mvsplat), which enables accurate camera-trajectory-control rendering. The encoded rendering latent ("input") and encoded input pairs latent ("condition") are combined as input to the consistency model. Subsequently, a forward diffusion operation is performed to add noise to the video. Then, the noised $\mathbf{x}_{n+1}^r$ is sent to both the student and teacher model to predict videos $\mathbf{x}_0^{pred}$ of timestep $t_{n+1}$ and $\hat{\mathbf{x}}_0^{\phi}$ of timestep $t_n$. Finally, the student model and DDPNet are updated independently through distillation loss and DDP loss.
  • Figure 3: Qualitative comparison. We can observe that baseline models suffer from issues such as blurriness, frame skipping, excessive motion, and shifts in the relative positioning of objects, while our VideoScene achieves higher output quality and improved 3D coherence.
  • Figure 4: Qualitative results in cross-dataset generalization. Models trained on the source dataset RealEstate10K are tested on ACID dataset. Fine-tuned models improve in 3D consistency but still fail with one-step.
  • Figure 5: Matching results comparison. Green represents high-quality matching results, while red represents discarded matching results. More green high-quality matches indicate a higher level of geometric consistency between the two views.
  • ...and 10 more figures