Table of Contents
Fetching ...

IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment

Yinan Chen, Jiangning Zhang, Teng Hu, Yuxiang Zeng, Zhucun Xue, Qingdong He, Chengjie Wang, Yong Liu, Xiaobin Hu, Shuicheng Yan

TL;DR

IVEBench tackles the lack of robust evaluation for instruction-guided video editing by introducing a large, diverse benchmark with 600 videos across seven semantic dimensions and 35 edit prompts. It defines a three-dimensional evaluation protocol—Video Quality, Instruction Compliance, and Video Fidelity—augmented with multimodal LLM-based metrics and human alignment. The work demonstrates that current IVE methods struggle with broad task coverage and per-frame fidelity, while showing strong alignment between automatic metrics and human judgments. By open-sourcing the dataset, prompts, and scoring framework, IVEBench aims to standardize and accelerate progress in the field.

Abstract

Instruction-guided video editing has emerged as a rapidly advancing research direction, offering new opportunities for intuitive content transformation while also posing significant challenges for systematic evaluation. Existing video editing benchmarks fail to support the evaluation of instruction-guided video editing adequately and further suffer from limited source diversity, narrow task coverage and incomplete evaluation metrics. To address the above limitations, we introduce IVEBench, a modern benchmark suite specifically designed for instruction-guided video editing assessment. IVEBench comprises a diverse database of 600 high-quality source videos, spanning seven semantic dimensions, and covering video lengths ranging from 32 to 1,024 frames. It further includes 8 categories of editing tasks with 35 subcategories, whose prompts are generated and refined through large language models and expert review. Crucially, IVEBench establishes a three-dimensional evaluation protocol encompassing video quality, instruction compliance and video fidelity, integrating both traditional metrics and multimodal large language model-based assessments. Extensive experiments demonstrate the effectiveness of IVEBench in benchmarking state-of-the-art instruction-guided video editing methods, showing its ability to provide comprehensive and human-aligned evaluation outcomes.

IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment

TL;DR

IVEBench tackles the lack of robust evaluation for instruction-guided video editing by introducing a large, diverse benchmark with 600 videos across seven semantic dimensions and 35 edit prompts. It defines a three-dimensional evaluation protocol—Video Quality, Instruction Compliance, and Video Fidelity—augmented with multimodal LLM-based metrics and human alignment. The work demonstrates that current IVE methods struggle with broad task coverage and per-frame fidelity, while showing strong alignment between automatic metrics and human judgments. By open-sourcing the dataset, prompts, and scoring framework, IVEBench aims to standardize and accelerate progress in the field.

Abstract

Instruction-guided video editing has emerged as a rapidly advancing research direction, offering new opportunities for intuitive content transformation while also posing significant challenges for systematic evaluation. Existing video editing benchmarks fail to support the evaluation of instruction-guided video editing adequately and further suffer from limited source diversity, narrow task coverage and incomplete evaluation metrics. To address the above limitations, we introduce IVEBench, a modern benchmark suite specifically designed for instruction-guided video editing assessment. IVEBench comprises a diverse database of 600 high-quality source videos, spanning seven semantic dimensions, and covering video lengths ranging from 32 to 1,024 frames. It further includes 8 categories of editing tasks with 35 subcategories, whose prompts are generated and refined through large language models and expert review. Crucially, IVEBench establishes a three-dimensional evaluation protocol encompassing video quality, instruction compliance and video fidelity, integrating both traditional metrics and multimodal large language model-based assessments. Extensive experiments demonstrate the effectiveness of IVEBench in benchmarking state-of-the-art instruction-guided video editing methods, showing its ability to provide comprehensive and human-aligned evaluation outcomes.

Paper Structure

This paper contains 29 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of our proposed IVEBench. 1) We construct a diverse video corpus consisting of 600 high-quality source videos systematically organized across 7 semantic dimensions. 2) For source videos, we design carefully crafted edit prompts, covering 8 major editing task categories with 35 subcategories. 3) We establish a comprehensive three-dimensional evaluation protocol comprising 12 metrics, enabling human-aligned benchmarking of state-of-the-art IVE methods.
  • Figure 2: Data acquisition and processing pipeline of IVEBench includes: 1) Curation process to 600 high-quality diverse videos. 2) Well-designed pipeline for comprehensive editing prompts.
  • Figure 3: Statistical distributions of IVEBench.
  • Figure 4: IVEBench Evaluation Results of Video Editing Models. We visualize the evaluation results of four IVE models in 12 IVEBench metrics. We normalize the results per dimension for clearer comparisons. For comprehensive numerical results, please refer to \ref{['tab:performance_comparison']}.
  • Figure 5: Qualitative comparison of state-of-the-art IVE methods.
  • ...and 2 more figures