Table of Contents
Fetching ...

EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

Yehonathan Litman, Shikun Liu, Dario Seyb, Nicholas Milef, Yang Zhou, Carl Marshall, Shubham Tulsiani, Caleb Leak

TL;DR

EditCtrl addresses the inefficiency of state-of-the-art generative video editing that processes full video context regardless of the edit size. It introduces a disentangled framework with a local context encoder operating on masked tokens and a lightweight global context embedder that preserves temporal coherence, enabling computation proportional to the edit area $|V_m|$ rather than the full video. The two adapters are integrated into a frozen base diffusion model, enabling multi-region editing and content propagation without fine-tuning the base model, and are trained with piecewise losses $L_phi$ and $L_psi$ to balance local fidelity and global coherence. Empirically, EditCtrl matches or surpasses full-attention baselines in quality while delivering substantial throughput gains and enabling real-time, high-resolution video editing and AR content propagation.

Abstract

High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and autoregressive content propagation.

EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

TL;DR

EditCtrl addresses the inefficiency of state-of-the-art generative video editing that processes full video context regardless of the edit size. It introduces a disentangled framework with a local context encoder operating on masked tokens and a lightweight global context embedder that preserves temporal coherence, enabling computation proportional to the edit area rather than the full video. The two adapters are integrated into a frozen base diffusion model, enabling multi-region editing and content propagation without fine-tuning the base model, and are trained with piecewise losses and to balance local fidelity and global coherence. Empirically, EditCtrl matches or surpasses full-attention baselines in quality while delivering substantial throughput gains and enabling real-time, high-resolution video editing and AR content propagation.

Abstract

High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and autoregressive content propagation.
Paper Structure (28 sections, 6 equations, 13 figures, 3 tables)

This paper contains 28 sections, 6 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: EditCtrl: A Real-time Generative Video Editing Pipeline. EditCtrl supports complex, prompt-guided edits on 4K videos, simultaneously handling an arbitrary number of user-defined masks (Top). To maintain real-time performance, our inference pipeline dynamically allocates compute proportional to the edit mask size (Middle). EditCtrl also intelligently propagates object edits from initial frames into the future (after the orange line), ensuring high temporal and object consistency in the resulting edit (Bottom).
  • Figure 2: EditCtrl Video Diffusion Framework Overview. EditCtrl edits a source video given a target edit mask. Foreground content is masked out, giving the background video that is also down-sampled to a constant resolution regardless of the original resolution. The compact global context of the down-sampled background video and the local context at the mask edit region are then encoded. These are given to trainable local and global adapters inside a pretrained text-to-video diffusion model that denoises tokens $\mathbf{z}^t$ only in the masked edit region given a text prompt. After diffusion, the tokens are scattered into the masked edit region in the encoded source video latents. Our method shows a proportional speedup with respect to the target mask area ratio.
  • Figure 3: EditCtrl: Local and Global Control Modules. Given the source video $\mathbf{V}_\text{src}$ and target edit masks $\mathbf{V}_m$, we extract the background content $\mathbf{V}_b$ and encode it with a video VAE encoder $\mathcal{E}$. This is then concatenated channel-wise with the down-sampled masks to give the control context $\mathbf{C}$. Tokens in $\mathbf{C}$ outside the down-sampled edit mask region are then masked out, giving the local context tokens $\mathbf{C}_\text{local}$ which go to the local encoder module $c_\phi$, whose outputs are added to selected transformer layers. The global embedder $G_\psi$ receives the query feature tokens and global context tokens produced from the down-sampled background content $\mathbf{V}_b^\downarrow$ and modulates the noisy cross-attended features.
  • Figure 4: Video Editing Comparison. EditCtrl generates visually appealing and structurally coherent edited content while the baselines either fail to edit the video correctly or produce content with poor appearance and blending. EditCtrl's localized editing greatly increases efficiency and enables real-time generative editing.
  • Figure 5: Video Inpainting Comparison. Even with full-attention, baseline methods struggle to inpaint content that is coherent and visually appealing, while our method successfully generates high fidelity content that aligns with the scene using much less compute.
  • ...and 8 more figures