Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin; Guoqiang Liang; Ziyun Zeng; Zechen Bai; Yanzhe Chen; Mike Zheng Shou

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou

TL;DR

A scalable data generation pipeline is introduced that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds, and a unified editing architecture is proposed, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance.

Abstract

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

TL;DR

Abstract

Paper Structure (22 sections, 2 equations, 12 figures, 5 tables)

This paper contains 22 sections, 2 equations, 12 figures, 5 tables.

Introduction
Related Work
Instruction-based Video Editing
Reference-Guided Video Editing and Dataset
RefVIE Dataset and Benchmark
Scalable Data Generation Pipeline
Dataset Statistics
Benchmark and Evaluation
Methodology
Architecture Design
Training Curriculum
Experiments
Implementation Details
Main Results
Qualitative Results
...and 7 more sections

Figures (12)

Figure 1: This teaser illustrates a selection of video editing tasks, including both instruction-only and instruction-reference scenarios, highlighting the superior editing capabilities of RefVIE.
Figure 2: Workflow of the reference image synthesis pipeline. We first ground the editing region in the target video frame using specialized grounding and segmentation models. Subsequently, we leverage a specialized image editing model to synthesize a high-quality reference image that maintains identity consistency with the instruction.
Figure 3: Pipeline of RefVIE curation. We process 3.7M raw samples through four stages: source aggregation and filtering, grounding and segmentation, reference image synthesis, and quality control, yielding 477K high-quality quadruplets.
Figure 4: RefVIE statistics and sample visualization. (a) Distribution of editing task types. (b) Distribution of video durations. (c) Example reference images for different editing categories.
Figure 5: Overview of our unified editing framework. We integrate a frozen MLLM (Qwen2.5-VL-3B) to encode multimodal instructions, injecting semantic conditions into the pre-trained Diffusion Transformer (Wan2.2-TI2V-5B) via dual learnable projectors for query and reference latents. To preserve consistency of source video, we employ a hybrid injection strategy within the DiT: source video features are added element-wise, while reference image features are concatenated to the input sequence.
...and 7 more figures

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

TL;DR

Abstract

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Authors

TL;DR

Abstract

Table of Contents

Figures (12)