EasyV2V: A High-quality Instruction-based Video Editing Framework

Jinjie Mai; Chaoyang Wang; Guocheng Gordon Qian; Willi Menapace; Sergey Tulyakov; Bernard Ghanem; Peter Wonka; Ashkan Mirzaei

EasyV2V: A High-quality Instruction-based Video Editing Framework

Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Peter Wonka, Ashkan Mirzaei

TL;DR

EasyV2V presents a lightweight, instruction-based video editor that achieves high-fidelity edits with flexible inputs by unifying data, architecture, and control. It leverages a diverse data pipeline—lifting I2I edits to V2V, composing off-the-shelf experts, and dense-captioned T2V data—together with a minimal-tuning backbone using sequence-based conditioning and LoRA, plus a single spatiotemporal mask for unified control. The approach yields state-of-the-art results on EditVerseBench, outperforms baselines with and without references, and demonstrates robust action and transition edits across broad categories. The work also provides extensive ablations, CFG analyses, and user studies, supporting its practical applicability while noting inference-time limitations and potential extensions to richer cinematic controls.

Abstract

While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/

EasyV2V: A High-quality Instruction-based Video Editing Framework

TL;DR

Abstract

EasyV2V: A High-quality Instruction-based Video Editing Framework

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)