Table of Contents
Fetching ...

Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion

Jangho Park, Taesung Kwon, Jong Chul Ye

TL;DR

This work tackles the challenge of generating synchronized multi-view 4D video from a single video without training. It introduces a two-stage pipeline that first creates boundary key frames using depth-guided novel-view synthesis, then fills the interior via spatio-temporal bidirectional interpolation conditioned on warped views, all without training a 4D diffusion model. The approach demonstrates competitive performance on fixed novel-view and bullet-time tasks, showing strong spatio-temporal coherence and robustness under varied conditions. By leveraging depth priors and off-the-shelf diffusion tools, Zero4D offers a practical, scalable solution for 4D video generation when large multi-view datasets and compute are unavailable.

Abstract

Multi-view or 4D video generation has emerged as a significant research topic. Nonetheless, recent approaches to 4D generation still struggle with fundamental limitations, as they primarily rely on harnessing multiple video diffusion models with additional training or compute-intensive training of a full 4D diffusion model with limited real-world 4D data and large computational costs. To address these challenges, here we propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video. Our approach consists of two key steps: (1) By designating the edge frames in the spatio-temporal sampling grid as key frames, we first synthesize them using a video diffusion model, leveraging a depth-based warping technique for guidance. This approach ensures structural consistency across the generated frames, preserving spatial and temporal coherence. (2) We then interpolate the remaining frames using a video diffusion model, constructing a fully populated and temporally coherent sampling grid while preserving spatial and temporal consistency. Through this approach, we extend a single video into a multi-view video along novel camera trajectories while maintaining spatio-temporal consistency. Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.

Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion

TL;DR

This work tackles the challenge of generating synchronized multi-view 4D video from a single video without training. It introduces a two-stage pipeline that first creates boundary key frames using depth-guided novel-view synthesis, then fills the interior via spatio-temporal bidirectional interpolation conditioned on warped views, all without training a 4D diffusion model. The approach demonstrates competitive performance on fixed novel-view and bullet-time tasks, showing strong spatio-temporal coherence and robustness under varied conditions. By leveraging depth priors and off-the-shelf diffusion tools, Zero4D offers a practical, scalable solution for 4D video generation when large multi-view datasets and compute are unavailable.

Abstract

Multi-view or 4D video generation has emerged as a significant research topic. Nonetheless, recent approaches to 4D generation still struggle with fundamental limitations, as they primarily rely on harnessing multiple video diffusion models with additional training or compute-intensive training of a full 4D diffusion model with limited real-world 4D data and large computational costs. To address these challenges, here we propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video. Our approach consists of two key steps: (1) By designating the edge frames in the spatio-temporal sampling grid as key frames, we first synthesize them using a video diffusion model, leveraging a depth-based warping technique for guidance. This approach ensures structural consistency across the generated frames, preserving spatial and temporal coherence. (2) We then interpolate the remaining frames using a video diffusion model, constructing a fully populated and temporally coherent sampling grid while preserving spatial and temporal consistency. Through this approach, we extend a single video into a multi-view video along novel camera trajectories while maintaining spatio-temporal consistency. Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.

Paper Structure

This paper contains 17 sections, 11 equations, 6 figures, 6 tables, 3 algorithms.

Figures (6)

  • Figure 1: Qualitative comparison. We compare our method with baseline models in terms of novel-view video generation and global spatio-temporal consistency. Given a single input video, both baselines and ours generate outputs across multiple views and time steps. To evaluate global consistency, we leverage baselines to produce bullet-time videos at all input frames and re-align them to a fixed viewpoint. We also visualize x–t slices (red lines) to highlight temporal coherence. While baselines exhibit inconsistencies across views and time, our method preserves spatio-temporal coherence and yields high-quality multi-view videos.
  • Figure 2: Generation pipeline of Zero4D:(a)Key frame generation step: Starting from the input video(shown as the gray-shaded row), we sequentially generate boundary frames—novel view synthesis, end-view video generation, and end-frame view synthesis—where each step leverages the results of the previous one. (b)Spatio-temporal bidirectional interpolation step: Starting from the noisy frames, we alternately perform camera-axis and time-axis interpolation, each conditioned on boundary frames, to progressively denoise the 4D grid. Through this bidirectional process, noisy latents are refined into globally coherent spatio-temporal videos. The detailed algorithm is described in Algorithm \ref{['alg:zero4d_pipeline']}.
  • Figure 3: Result from Zero4D. Our model generates high-quality 4D videos from a single input video, enabling diverse camera motions such as orbit, transition, and dolly movements. As illustrated, the synthesized videos maintain spatial and temporal consistency across multiple views and frames, effectively rendering novel perspectives that are not present in the original input. Best viewed with Acrobat Reader. Click first two rows' images to play the video clip.
  • Figure 4: Camera trajectory visualization. With a monocular depth estimation model, our approach can generate various novel view videos with spatio-temporal synchronized videos.
  • Figure 5: Input Video Warping. Given a single video, we utilize an off-the-shelf depth estimation model to generate warped frames from novel viewpoints.
  • ...and 1 more figures