Table of Contents
Fetching ...

Generative View Stitching

Chonghyuk Song, Michal Stary, Boyuan Chen, George Kopanas, Vincent Sitzmann

TL;DR

This paper addresses the limitation of autoregressive video diffusion models in conditioning on future frames for camera-guided generation, which often leads to collisions and collapse. It introduces Generative View Stitching (GVS), a training-free diffusion stitching method that samples an entire sequence in parallel and is compatible with any Diffusion Forcing video model, enabling faithful adherence to predefined camera trajectories. To achieve temporal and long-range coherence, the authors propose Omni Guidance to strengthen past/future conditioning and a loop-closing mechanism via cyclic conditioning that enforces consistency across the full sequence. Empirical results show GVS yields stable, collision-free, and loop-closing video generations across diverse trajectories, including the Impossible Staircase, with strong improvements over autoregressive baselines and diffusion-stitching counterparts while maintaining competitive visual quality.

Abstract

Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersvärd's Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.

Generative View Stitching

TL;DR

This paper addresses the limitation of autoregressive video diffusion models in conditioning on future frames for camera-guided generation, which often leads to collisions and collapse. It introduces Generative View Stitching (GVS), a training-free diffusion stitching method that samples an entire sequence in parallel and is compatible with any Diffusion Forcing video model, enabling faithful adherence to predefined camera trajectories. To achieve temporal and long-range coherence, the authors propose Omni Guidance to strengthen past/future conditioning and a loop-closing mechanism via cyclic conditioning that enforces consistency across the full sequence. Empirical results show GVS yields stable, collision-free, and loop-closing video generations across diverse trajectories, including the Impossible Staircase, with strong improvements over autoregressive baselines and diffusion-stitching counterparts while maintaining competitive visual quality.

Abstract

Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersvärd's Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.

Paper Structure

This paper contains 38 sections, 6 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: Generative View Stitching (GVS) enables stable camera-guided generation of long videos. Given a pretrained DFoT video model song2025historyguidedvideodiffusion with an 8-frame context window and predefined camera trajectory, GVS can generate a 120-frame navigation video that is stable, collision-free, faithful to the conditioning trajectory, consistent, and closes loops. On the other hand, Autoregressive sampling diverges due to collisions with the generated scene, is not faithful to the conditioning trajectory, and demonstrates poor loop closure even when augmented with RAG.
  • Figure 2: Generative View Stitching (GVS) is a training-free diffusion stitching method that is compatible with any off-the-shelf video model trained with Diffusion Forcing (DF). We first partition the target video into non-overlapping chunks shorter than the model's context window, then denoise every target chunkjointly with its neighboringchunks to condition on both the past and future. We use the denoised target chunk of every context window to update the noisy stitched sequence while discarding the denoised past and future conditioning chunks. We further enable Omni Guidance (Sec. \ref{['sec:omniguidance']}), which enhances temporal consistency, by replacing the original score function $\epsilon_{\theta}$ with the guided score function $\tilde{\epsilon}_{\theta}$ in Eq. \ref{['eq:guided_score']}.
  • Figure 3: Effect of Omni Guidance and Stochasticity. Without Omni Guidance and zero stochasticity ($\eta = 0$), the generations lack temporal consistency and instead exhibit hazy transitions between different scenes. Increasing stochasticity to its maximum $(\eta = 1.0)$ enhances consistency but leads to oversmoothing. Our full method with Omni Guidance and partial stochasticity $(\eta = 0.9)$ enables consistent generation without oversmoothing.
  • Figure 4: GVS Requires Explicit Loop Closing. Despite its global theoretical receptive field, our stitching method requires an explicit loop closing mechanism to "visually return to the same place". Note that the camera centers are offset from the panorama's rotation center purely for visual clarity.
  • Figure 5: Loop Closing via Cyclic Conditioning. GVS closes loops via cyclic conditioning, whereby target chunks are denoised by two alternating sets of context windows: temporal windows, which condition target chunks on their temporally neighboring chunks, and spatial windows, which condition target chunks on temporally distant but spatially close neighboring chunks. As a result, target chunks are conditioned on all relevant neighbors across the entire stitching process. See Fig. \ref{['fig:cyclic_cond_all']} for the full set of spatial windows.
  • ...and 11 more figures