Table of Contents
Fetching ...

InstructVEdit: A Holistic Approach for Instructional Video Editing

Chi Zhang, Chengjian Feng, Feng Yan, Qiming Zhang, Mingjin Zhang, Yujie Zhong, Jing Zhang, Lin Ma

TL;DR

Video instruction-guided editing has been hampered by scarce high-quality paired data. InstructVEdit delivers a holistic solution with a dataset curation workflow that leverages image editing data, and two model innovations—Soft Motion Adapter (SMA) and Editing-guided Propagation Module (EPM)—to improve edit fidelity while maintaining temporal coherence. A multi-round iterative refinement strategy incorporates real-world video data to bridge synthetic-real domain gaps, delivering state-of-the-art results on TGVE and TGVE+ benchmarks and strong user-preference signals. The approach offers a scalable, practical pipeline for instruction-based video editing, reducing data requirements while enhancing generalization to real-world scenarios.

Abstract

Video editing according to instructions is a highly challenging task due to the difficulty in collecting large-scale, high-quality edited video pair data. This scarcity not only limits the availability of training data but also hinders the systematic exploration of model architectures and training strategies. While prior work has improved specific aspects of video editing (e.g., synthesizing a video dataset using image editing techniques or decomposed video editing training), a holistic framework addressing the above challenges remains underexplored. In this study, we introduce InstructVEdit, a full-cycle instructional video editing approach that: (1) establishes a reliable dataset curation workflow to initialize training, (2) incorporates two model architectural improvements to enhance edit quality while preserving temporal consistency, and (3) proposes an iterative refinement strategy leveraging real-world data to enhance generalization and minimize train-test discrepancies. Extensive experiments show that InstructVEdit achieves state-of-the-art performance in instruction-based video editing, demonstrating robust adaptability to diverse real-world scenarios. Project page: https://o937-blip.github.io/InstructVEdit.

InstructVEdit: A Holistic Approach for Instructional Video Editing

TL;DR

Video instruction-guided editing has been hampered by scarce high-quality paired data. InstructVEdit delivers a holistic solution with a dataset curation workflow that leverages image editing data, and two model innovations—Soft Motion Adapter (SMA) and Editing-guided Propagation Module (EPM)—to improve edit fidelity while maintaining temporal coherence. A multi-round iterative refinement strategy incorporates real-world video data to bridge synthetic-real domain gaps, delivering state-of-the-art results on TGVE and TGVE+ benchmarks and strong user-preference signals. The approach offers a scalable, practical pipeline for instruction-based video editing, reducing data requirements while enhancing generalization to real-world scenarios.

Abstract

Video editing according to instructions is a highly challenging task due to the difficulty in collecting large-scale, high-quality edited video pair data. This scarcity not only limits the availability of training data but also hinders the systematic exploration of model architectures and training strategies. While prior work has improved specific aspects of video editing (e.g., synthesizing a video dataset using image editing techniques or decomposed video editing training), a holistic framework addressing the above challenges remains underexplored. In this study, we introduce InstructVEdit, a full-cycle instructional video editing approach that: (1) establishes a reliable dataset curation workflow to initialize training, (2) incorporates two model architectural improvements to enhance edit quality while preserving temporal consistency, and (3) proposes an iterative refinement strategy leveraging real-world data to enhance generalization and minimize train-test discrepancies. Extensive experiments show that InstructVEdit achieves state-of-the-art performance in instruction-based video editing, demonstrating robust adaptability to diverse real-world scenarios. Project page: https://o937-blip.github.io/InstructVEdit.

Paper Structure

This paper contains 22 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: We compare our data curation approach with the existing method that directly utilizes P2P to generate video editing pairs.
  • Figure 2: The data curation workflow. First, we train an image editing model primarily on a realistic-style dataset, then perform first-frame editing and filter out successful sampling trajectories. To construct video editing data pairs, we extend source images into source videos using an image-to-video model and generate the corresponding edited target videos with our proposed First-Frame-Guided Video-to-Video model. In our workflow, the training process is enclosed in a red frame, while the inference processes are enclosed in blue frames.
  • Figure 3: Overview of the InstructVEdit model. We introduce two structural innovations: the Soft Motion Adapter (SMA) and the Editing-guided Propagation Module (EPM), which enhance the editing capabilities of the pre-trained image editing model while maintaining temporal consistency. The bottom section of the figure clarifies the information flow from EPM to self-attention.
  • Figure 4: Visualization of editing results from different models.
  • Figure 5: Visualization of editing results with and without EPM. To assess the impact of EPM, we perform inference on the source video (first row) while masking EPM's output. The importance sequence $s$ is computed as the column-wise average of $S$ from all EPM layers at the final denoising step, where each value represents the relative significance of a frame. The second row shows the resulting output and its heatmap, with red indicating higher importance and dark colors indicating lower importance. The third row illustrates the output with EPM enabled, highlighting frames with stronger edit effects and effectively propagating them.
  • ...and 1 more figures