Table of Contents
Fetching ...

Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists

Bojia Zi, Penghui Ruan, Marco Chen, Xianbiao Qi, Shaozhe Hao, Shihao Zhao, Youze Huang, Bin Liang, Rong Xiao, Kam-Fai Wong

TL;DR

The paper introduces Señorita-2M, a large-scale (≈2M) instruction-based video editing dataset built from four specialized video editors to close data quality gaps in end-to-end video editing. It presents a comprehensive methodology, including the construction of global/local editors, a diverse data source from Pexels, and a robust, multi-stage data filtering pipeline guided by LLM-generated instructions. Extensive experiments demonstrate superior text–video alignment, temporal consistency, and editing fidelity across multiple architectures, with ablations validating the impact of dataset size and model variations. The work advances practical, high-quality video editing via supervised, instruction-driven training and provides open-source resources to the community.

Abstract

Recent advancements in video generation have spurred the development of video editing techniques, which can be divided into inversion-based and end-to-end methods. However, current video editing methods still suffer from several challenges. Inversion-based methods, though training-free and flexible, are time-consuming during inference, struggle with fine-grained editing instructions, and produce artifacts and jitter. On the other hand, end-to-end methods, which rely on edited video pairs for training, offer faster inference speeds but often produce poor editing results due to a lack of high-quality training video pairs. In this paper, to close the gap in end-to-end methods, we introduce Señorita-2M, a high-quality video editing dataset. Señorita-2M consists of approximately 2 millions of video editing pairs. It is built by crafting four high-quality, specialized video editing models, each crafted and trained by our team to achieve state-of-the-art editing results. We also propose a filtering pipeline to eliminate poorly edited video pairs. Furthermore, we explore common video editing architectures to identify the most effective structure based on current pre-trained generative model. Extensive experiments show that our dataset can help to yield remarkably high-quality video editing results. More details are available at https://senorita-2m-dataset.github.io.

Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists

TL;DR

The paper introduces Señorita-2M, a large-scale (≈2M) instruction-based video editing dataset built from four specialized video editors to close data quality gaps in end-to-end video editing. It presents a comprehensive methodology, including the construction of global/local editors, a diverse data source from Pexels, and a robust, multi-stage data filtering pipeline guided by LLM-generated instructions. Extensive experiments demonstrate superior text–video alignment, temporal consistency, and editing fidelity across multiple architectures, with ablations validating the impact of dataset size and model variations. The work advances practical, high-quality video editing via supervised, instruction-driven training and provides open-source resources to the community.

Abstract

Recent advancements in video generation have spurred the development of video editing techniques, which can be divided into inversion-based and end-to-end methods. However, current video editing methods still suffer from several challenges. Inversion-based methods, though training-free and flexible, are time-consuming during inference, struggle with fine-grained editing instructions, and produce artifacts and jitter. On the other hand, end-to-end methods, which rely on edited video pairs for training, offer faster inference speeds but often produce poor editing results due to a lack of high-quality training video pairs. In this paper, to close the gap in end-to-end methods, we introduce Señorita-2M, a high-quality video editing dataset. Señorita-2M consists of approximately 2 millions of video editing pairs. It is built by crafting four high-quality, specialized video editing models, each crafted and trained by our team to achieve state-of-the-art editing results. We also propose a filtering pipeline to eliminate poorly edited video pairs. Furthermore, we explore common video editing architectures to identify the most effective structure based on current pre-trained generative model. Extensive experiments show that our dataset can help to yield remarkably high-quality video editing results. More details are available at https://senorita-2m-dataset.github.io.

Paper Structure

This paper contains 46 sections, 19 figures, 8 tables.

Figures (19)

  • Figure 1: The visual results given by editing models trained on our Señorita-2M. Best viewed with Acrobat Reader. Click the images to play the animation clips.
  • Figure 2: Top: The data construction pipeline of the Señorita-2M dataset. Bottom: The filtering pipeline of Señorita-2M. Further details are provided in the Appendix \ref{['sec_dataset_construction_global_editing']}.
  • Figure 3: Visualization of our Señorita-2M. Best viewed with Acrobat Reader. Click the images to play the animation clips.
  • Figure 4: Overview of Señorita-2M dataset: a statistical analysis.
  • Figure 6: The construction pipeline of annotated training dataset for experts training.
  • ...and 14 more figures