Table of Contents
Fetching ...

Adapting VACE for Real-Time Autoregressive Video Diffusion

Ryan Fosdick

TL;DR

This adaptation reuses existing pretrained VACE weights without additional training, and adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model.

Abstract

We describe an adaptation of VACE (Video All-in-one Creation and Editing) for real-time autoregressive video generation. VACE provides unified video control (reference guidance, structural conditioning, inpainting, and temporal extension) but assumes bidirectional attention over full sequences, making it incompatible with streaming pipelines that require fixed chunk sizes and causal attention. The key modification moves reference frames from the diffusion latent space into a parallel conditioning pathway, preserving the fixed chunk sizes and KV caching that autoregressive models require. This adaptation reuses existing pretrained VACE weights without additional training. Across 1.3B and 14B model scales, VACE adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model. Reference-to-video fidelity is severely degraded compared to batch VACE due to causal attention constraints. A reference implementation is available at https://github.com/daydreamlive/scope.

Adapting VACE for Real-Time Autoregressive Video Diffusion

TL;DR

This adaptation reuses existing pretrained VACE weights without additional training, and adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model.

Abstract

We describe an adaptation of VACE (Video All-in-one Creation and Editing) for real-time autoregressive video generation. VACE provides unified video control (reference guidance, structural conditioning, inpainting, and temporal extension) but assumes bidirectional attention over full sequences, making it incompatible with streaming pipelines that require fixed chunk sizes and causal attention. The key modification moves reference frames from the diffusion latent space into a parallel conditioning pathway, preserving the fixed chunk sizes and KV caching that autoregressive models require. This adaptation reuses existing pretrained VACE weights without additional training. Across 1.3B and 14B model scales, VACE adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model. Reference-to-video fidelity is severely degraded compared to batch VACE due to causal attention constraints. A reference implementation is available at https://github.com/daydreamlive/scope.
Paper Structure (15 sections, 1 equation, 4 figures, 7 tables)

This paper contains 15 sections, 1 equation, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Original VACE concatenates references into the latent sequence, requiring post-hoc stripping. The streaming adaptation processes references through separate Context Blocks that inject hints into the DiT pathway, preserving fixed chunk sizes.
  • Figure 2: Per-chunk processing in the streaming VACE adaptation. Reference images are encoded once by Context Blocks; hints are injected into DiT blocks for each video chunk. The KV cache persists across chunks for autoregressive continuity.
  • Figure 3: Structural control modes. Each row: input frame, extracted conditioning signal, and generated output. Depth, scribble/edge, optical flow, and colorization (grayscale) controls shown.
  • Figure 4: Masked generation, layout control, and temporal extension. All outputs generated in real-time.