Table of Contents
Fetching ...

Unified Video Editing with Temporal Reasoner

Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, Qiang Wu

TL;DR

VideoCoF introduces a Chain-of-Frames framework that enforces seeing, reasoning, then editing for unified video editing without masks. By predicting edit-region latents in a dedicated reasoning step and applying a RoPE-based alignment, it achieves precise instruction-to-region mapping and robust length extrapolation. Trained on a compact 50k-video dataset, VideoCoF sets new state-of-the-art results on VideoCoF-Bench with strong instance-level editing capabilities and efficient data usage. The approach opens avenues for broader task generalization, longer sequence handling, and potential integration with image editing data for cross-domain transfer.

Abstract

Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.

Unified Video Editing with Temporal Reasoner

TL;DR

VideoCoF introduces a Chain-of-Frames framework that enforces seeing, reasoning, then editing for unified video editing without masks. By predicting edit-region latents in a dedicated reasoning step and applying a RoPE-based alignment, it achieves precise instruction-to-region mapping and robust length extrapolation. Trained on a compact 50k-video dataset, VideoCoF sets new state-of-the-art results on VideoCoF-Bench with strong instance-level editing capabilities and efficient data usage. The approach opens avenues for broader task generalization, longer sequence handling, and potential integration with image editing data for cross-domain transfer.

Abstract

Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.

Paper Structure

This paper contains 23 sections, 2 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of the difference between previous methods and our VideoCoF. We enhances the editing accuracy by forcing the video diffusion model to first predict the editing area, and then perform the editing.
  • Figure 2: Overview of VideoCoF framework. Our model processes source (blue), reasoning (orange), and target (green) tokens in a unified sequence to "reason" then "edit". Bottom right: Our RoPE design enables length extrapolation.
  • Figure 3: How our RoPE design avoid index collision.
  • Figure 4: Our data curation pipeline for multi-instance data.
  • Figure 5: Visual comparision between our VideoCoF and other methods on diverse video editing tasks.
  • ...and 4 more figures