Table of Contents
Fetching ...

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu, Weicai Ye, Xintao Wang, Pengfei Wan, Kun Gai, Xiangyu Yue

TL;DR

VideoCanvas tackles the problem of arbitrary spatio-temporal video completion by unifying diverse controllable generation tasks under a single framework. It introduces In-Context Conditioning (ICC) with a hybrid Spatial Zero-Padding and Temporal RoPE Interpolation strategy to achieve pixel-frame-aware control on a frozen VAE and a lightly tuned DiT backbone, avoiding backbone retraining. The paper formalizes the task, proposes the VideoCanvas pipeline, and presents VideoCanvasBench as a comprehensive benchmark with intra-scene fidelity and inter-scene creativity tests. Experiments show that ICC with Temporal RoPE Interpolation delivers superior fidelity and dynamic consistency across tasks like AnyP2V, AnyI2V, and AnyV2V, while enabling flexible applications such as long-duration extension and camera-like motion control. This work provides a robust, scalable foundation for flexible and unified controllable video synthesis and offers a practical benchmark for future research.

Abstract

We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

TL;DR

VideoCanvas tackles the problem of arbitrary spatio-temporal video completion by unifying diverse controllable generation tasks under a single framework. It introduces In-Context Conditioning (ICC) with a hybrid Spatial Zero-Padding and Temporal RoPE Interpolation strategy to achieve pixel-frame-aware control on a frozen VAE and a lightly tuned DiT backbone, avoiding backbone retraining. The paper formalizes the task, proposes the VideoCanvas pipeline, and presents VideoCanvasBench as a comprehensive benchmark with intra-scene fidelity and inter-scene creativity tests. Experiments show that ICC with Temporal RoPE Interpolation delivers superior fidelity and dynamic consistency across tasks like AnyP2V, AnyI2V, and AnyV2V, while enabling flexible applications such as long-duration extension and camera-like motion control. This work provides a robust, scalable foundation for flexible and unified controllable video synthesis and offers a practical benchmark for future research.

Abstract

We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.

Paper Structure

This paper contains 45 sections, 4 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: VideoCanvas: Arbitrary Spatio-Temporal Video Completion. Given any conditions (frames or patches, outlined in red), the model fills in the remaining gray regions to generate coherent, high-quality videos. This unified formulation subsumes various tasks such as Any-Timestep-Patch/Image-to-Video, In/Outpainting, Camera Control, and Cross-scene Video Transitions, all in a zero-shot manner. More results are available on our https://onevfall.github.io/project_page/videocanvas/. Best viewed zoomed in.
  • Figure 2: Core challenge and solution for pixel-frame-aware conditioning.(a) Causal VAEs create temporal ambiguity by mapping frames to a single latent. We propose a hybrid solution combining Spatial Padding with Temporal RoPE Interpolation. (b) We show how competing paradigms are ill-suited for fine-grained control, while our ICC approach provides an effective solution.
  • Figure 3: The pipeline of VideoCanvas, which fine-tunes a base T2V model for arbitrary spatio-temporal control with zero new parameters. Our framework leverages the In-Context Conditioning (ICC) paradigm. After preparing conditional patches with zero-padding for spatial placement, we use independent VAE encoding for temporal decoupling. Our RoPE Interpolation then aligns each discrete token by mapping its source pixel-frame index $Y$ to a fractional position $Y/N$, where $N$ is the VAE temporal stride (here, $N=4$). As illustrated, this maps Frame $41$ to position $10.25$. This strategy enables fine-grained control without architectural changes.
  • Figure 4: Impact of Temporal RoPE Interpolation. Per-frame PSNR for single-frame I2V with targets $2$/$3$/$4$. Our method (red, solid) peaks exactly at the target frame. "w/o RoPE Interpolation" (blue, dashed) misaligns, "Latent-space Condition" (orange, dot-dashed) collapses motion, and "Pixel-space Padding" (green, dotted) is precise but degraded.
  • Figure 5: Padding vs. RoPE Interp.
  • ...and 11 more figures