ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment
Yiyang Chen, Xuanhua He, Xiujun Ma, Yue Ma
TL;DR
ContextFlow tackles the challenge of training-free video object editing on Diffusion Transformers by combining high-fidelity RF-Solver inversion with Adaptive Context Enrichment, which softly fuses reconstruction and editing contexts in self-attention. A data-driven Vital Layer Analysis (Guidance Responsiveness) identifies the most influential DiT blocks for targeted guidance, enabling efficient, task-specific layer injection. Empirical results on object insertion, swapping, and deletion show strong improvements over training-free baselines and competitive performance with some training-based methods, with enhanced temporal coherence and background preservation. The work highlights a practical, scalable path for zero-shot video editing that adapts to DiT architectures and reduces artifacts associated with hard feature replacement.
Abstract
Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude "hard" feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.
