Table of Contents
Fetching ...

UNIC: Unified In-Context Video Editing

Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qifeng Chen, Wenhan Luo

TL;DR

This work tackles the fragmentation of video editing tasks by proposing UNIC, a unified in-context framework that represents all inputs as a single sequence of tokens across three types: noisy tokens, reference video tokens, and multi-modal condition tokens. By processing this concatenated token stream with a diffusion-transformer backbone and introducing Condition Bias and Task-aware RoPE, UNIC achieves flexible, task-agnostic editing without task-specific adapters. A unified six-task benchmark demonstrates strong, across-the-board performance and emergent task composition capabilities, while ablations validate the necessity of the proposed conditioning mechanisms. The approach promises scalable multi-task video editing with efficient adaptation to new conditions and resolutions, signaling a shift toward more general-purpose video editing models.

Abstract

Recent advances in text-to-video generation have sparked interest in generative video editing tasks. Previous methods often rely on task-specific architectures (e.g., additional adapter modules) or dedicated customizations (e.g., DDIM inversion), which limit the integration of versatile editing conditions and the unification of various editing tasks. In this paper, we introduce UNified In-Context Video Editing (UNIC), a simple yet effective framework that unifies diverse video editing tasks within a single model in an in-context manner. To achieve this unification, we represent the inputs of various video editing tasks as three types of tokens: the source video tokens, the noisy video latent, and the multi-modal conditioning tokens that vary according to the specific editing task. Based on this formulation, our key insight is to integrate these three types into a single consecutive token sequence and jointly model them using the native attention operations of DiT, thereby eliminating the need for task-specific adapter designs. Nevertheless, direct task unification under this framework is challenging, leading to severe token collisions and task confusion due to the varying video lengths and diverse condition modalities across tasks. To address these, we introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks. This allows our approach to adaptively perform different video editing tasks by referring the source video and varying condition tokens "in context", and support flexible task composition. To validate our method, we construct a unified video editing benchmark containing six representative video editing tasks. Results demonstrate that our unified approach achieves superior performance on each task and exhibits emergent task composition abilities.

UNIC: Unified In-Context Video Editing

TL;DR

This work tackles the fragmentation of video editing tasks by proposing UNIC, a unified in-context framework that represents all inputs as a single sequence of tokens across three types: noisy tokens, reference video tokens, and multi-modal condition tokens. By processing this concatenated token stream with a diffusion-transformer backbone and introducing Condition Bias and Task-aware RoPE, UNIC achieves flexible, task-agnostic editing without task-specific adapters. A unified six-task benchmark demonstrates strong, across-the-board performance and emergent task composition capabilities, while ablations validate the necessity of the proposed conditioning mechanisms. The approach promises scalable multi-task video editing with efficient adaptation to new conditions and resolutions, signaling a shift toward more general-purpose video editing models.

Abstract

Recent advances in text-to-video generation have sparked interest in generative video editing tasks. Previous methods often rely on task-specific architectures (e.g., additional adapter modules) or dedicated customizations (e.g., DDIM inversion), which limit the integration of versatile editing conditions and the unification of various editing tasks. In this paper, we introduce UNified In-Context Video Editing (UNIC), a simple yet effective framework that unifies diverse video editing tasks within a single model in an in-context manner. To achieve this unification, we represent the inputs of various video editing tasks as three types of tokens: the source video tokens, the noisy video latent, and the multi-modal conditioning tokens that vary according to the specific editing task. Based on this formulation, our key insight is to integrate these three types into a single consecutive token sequence and jointly model them using the native attention operations of DiT, thereby eliminating the need for task-specific adapter designs. Nevertheless, direct task unification under this framework is challenging, leading to severe token collisions and task confusion due to the varying video lengths and diverse condition modalities across tasks. To address these, we introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks. This allows our approach to adaptively perform different video editing tasks by referring the source video and varying condition tokens "in context", and support flexible task composition. To validate our method, we construct a unified video editing benchmark containing six representative video editing tasks. Results demonstrate that our unified approach achieves superior performance on each task and exhibits emergent task composition abilities.

Paper Structure

This paper contains 38 sections, 4 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Unified In-Context Video Editing enables unified video editing and emergent task composition. Here we demonstrate the unification of six representative tasks, including ID Insert/Delete/Swap, Re-Camera Control, Stylization, and Propagation.
  • Figure 2: Architectural comparison for incorporating conditioning signals.(a) Extra Stage: Utilizes DDIM inversion on a reference video to derive inverted noise. (b) Extra One-to-One Control Modules: Employs dedicated, separate modules to process each control signal (e.g., reference video, multi-modal signals) and inject guidance into the diffusion model. (c) In-Context Video Editing (Ours): Our proposed method directly integrates guidance by tokenizing all conditioning signals (reference video, multi-modal signals) and concatenating them with the noisy input tokens, allowing the diffusion model to process all information jointly within its input sequence.
  • Figure 3: Overall Pipeline of Unified In-Context Video Editing. Our framework utilizes a unified transformer architecture for video editing. The model input is created by concatenating noisy tokens, reference video tokens, and multi-modal condition tokens (task-specific controls like images), these combined tokens form a single input sequence along the frame dimension. By simply modifying the multi-modal condition tokens, this framework can handle any video editing task.
  • Figure 4: ID Pool and example of ID insert evaluation cases.
  • Figure 5: Example of ID swap evaluation cases.
  • ...and 7 more figures