Table of Contents
Fetching ...

ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

Yiyang Chen, Xuanhua He, Xiujun Ma, Yue Ma

TL;DR

ContextFlow tackles the challenge of training-free video object editing on Diffusion Transformers by combining high-fidelity RF-Solver inversion with Adaptive Context Enrichment, which softly fuses reconstruction and editing contexts in self-attention. A data-driven Vital Layer Analysis (Guidance Responsiveness) identifies the most influential DiT blocks for targeted guidance, enabling efficient, task-specific layer injection. Empirical results on object insertion, swapping, and deletion show strong improvements over training-free baselines and competitive performance with some training-based methods, with enhanced temporal coherence and background preservation. The work highlights a practical, scalable path for zero-shot video editing that adapts to DiT architectures and reduces artifacts associated with hard feature replacement.

Abstract

Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude "hard" feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.

ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

TL;DR

ContextFlow tackles the challenge of training-free video object editing on Diffusion Transformers by combining high-fidelity RF-Solver inversion with Adaptive Context Enrichment, which softly fuses reconstruction and editing contexts in self-attention. A data-driven Vital Layer Analysis (Guidance Responsiveness) identifies the most influential DiT blocks for targeted guidance, enabling efficient, task-specific layer injection. Empirical results on object insertion, swapping, and deletion show strong improvements over training-free baselines and competitive performance with some training-based methods, with enhanced temporal coherence and background preservation. The work highlights a practical, scalable path for zero-shot video editing that adapts to DiT architectures and reduces artifacts associated with hard feature replacement.

Abstract

Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude "hard" feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.

Paper Structure

This paper contains 45 sections, 6 equations, 23 figures, 5 tables.

Figures (23)

  • Figure 1: Showcase of ContextFlow. Our ContextFlow achieves versatile and high-fidelity video object editing without any training. Our method demonstrates superior ability in a range of object-related challenging tasks, including object insertion (1st row), swapping (2nd row), and deletion (3rd row). The core design of our approach is Adaptive Context Enrichment, which allows for seamless integration of new elements with realistic interactions and meticulous preservation of the original scenes.
  • Figure 2: Motivation for ContextFlow. We highlight two core failures of prior methods: DDIM inversion causes poor reconstruction results, while "hard replacement" leads to misaligned attentions that only focus on the original background. ContextFlow systematically solves both.
  • Figure 3: Overview of the ContextFlow. Our method begins with a high-fidelity video inversion using RF-Solver to obtain a shared noise latent $\mathbf{z}_T$. A dual-path sampling process then decouples reconstruction and editing. The editing path is guided by our core mechanism, Adaptive Context Enrichment, where Key-Value pairs from the reconstruction path are concatenated into the self-attention blocks of the editing path. This guidance is precisely targeted to vital layers, identified via our Guidance Responsiveness analysis, and is only active during the first half of the denoising process to balance fidelity and consistency.
  • Figure 4: Resolving Contextual Conflict. Hard replacement misdirects attention for edited queries, suppressing object synthesis. Our Adaptive Context Enrichment resolves this by offering a dual context: the Editing Path for synthesizing the new object, and the Reconstruction Path for preserving background structure. Attention in unedited regions remains correct, confirming our method is non-invasive.
  • Figure 5: Task-Dependent Guidance Responsiveness (min-max normalized data in the figure). A higher Guidance Responsiveness indicates greater influence. There are three primary zones across all layers, which distribute in the shallow area (layers 1-10), mid-layer area (layers 15-21) and deep area (layers 26-32) respectively. Moreover, the numerical ranking of Guidance Responsiveness for these three regions varies depending on the specific task.
  • ...and 18 more figures