Table of Contents
Fetching ...

Region-Constraint In-Context Generation for Instructional Video Editing

Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, Tao Mei

TL;DR

This work tackles instructional video editing from a data-efficient, region-aware perspective. It introduces ReCo, which uses region-constrained in-context generation with latent-space and attention-space regularizations to localize edits and suppress interference from non-editing regions, trained on a large ReCo-Data dataset. The approach combines width-wise video denoising, a video condition branch, and flow-matching objectives, achieving superior edit accuracy, naturalness, and visual quality across four editing tasks. The authors also present ReCo-Data (500K high-quality instruction-video pairs) and a VLLM-based evaluation benchmark, enabling robust evaluation of instruction-based video editing models and demonstrating strong generalization to creative edits.

Abstract

The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.

Region-Constraint In-Context Generation for Instructional Video Editing

TL;DR

This work tackles instructional video editing from a data-efficient, region-aware perspective. It introduces ReCo, which uses region-constrained in-context generation with latent-space and attention-space regularizations to localize edits and suppress interference from non-editing regions, trained on a large ReCo-Data dataset. The approach combines width-wise video denoising, a video condition branch, and flow-matching objectives, achieving superior edit accuracy, naturalness, and visual quality across four editing tasks. The authors also present ReCo-Data (500K high-quality instruction-video pairs) and a VLLM-based evaluation benchmark, enabling robust evaluation of instruction-based video editing models and demonstrating strong generalization to creative edits.

Abstract

The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.

Paper Structure

This paper contains 22 sections, 19 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Our ReCo enables video editing based on sole textual instructions, achieving precise and high-fidelity video content modification. ReCo can adeptly handle diverse and challenging video editing tasks, including both local object editing and global style transfer.
  • Figure 2: An overview of our ReCo framework. We reformulate the instructional video editing task as an in-context generation paradigm, guided by the source video and instruction prompt. The source video is treated as an explicit condition via feeding it into an auxiliary video condition branch. To emphasize editing modifications and alleviate the tokens interference between editing and non-editing areas, ReCo introduces two region-based constraints: (1) Latent-space regularization, which increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas. (2) Attention-space regularization, which suppresses the attention of the target edit region towards the corresponding region in the source video, thereby mitigating inherent token interference, while simultaneously strengthening the attention on its own generated content.
  • Figure 3: Comparison between existing video editing datasets and our ReCo-Data. Ours features the most balanced data distribution and has a higher ratio of the high-quality samples.
  • Figure 4: Examples of video editing (i.e., add object, replace object and style transfer) results by different approaches.
  • Figure 5: Visual comparisons on the object removal task.
  • ...and 4 more figures