VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment

Shangkun Sun; Xiaoyu Liang; Songlin Fan; Wenxu Gao; Wei Gao

VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment

Shangkun Sun, Xiaoyu Liang, Songlin Fan, Wenxu Gao, Wei Gao

TL;DR

VE-Bench addresses the lack of reliable quantitative metrics for evaluating text-driven video editing by introducing VE-Bench DB, the first subjective-aligned VQA dataset for edited videos, and VE-Bench QA, a multi-modal metric that emphasizes text-video alignment, source-target relevance, and visual quality. The approach combines a BLIP-based temporal alignment module with a Uniformer-based source-target backbone and specialized aesthetics/distortion branches, trained with PLCC and rank losses. Results show VE-Bench QA achieves significantly better alignment with human MOS than prior metrics (e.g., DOVER, CLIP-based) on VE-Bench DB and achieves strong performance on T2VQA-DB, supporting broader adoption for evaluating AIGC video editing. The work provides a valuable resource for robust, human-aligned evaluation in text-driven video editing and offers practical data and code for the community.

Abstract

Text-driven video editing has recently experienced rapid development. Despite this, evaluating edited videos remains a considerable challenge. Current metrics tend to fail to align with human perceptions, and effective quantitative metrics for video editing are still notably absent. To address this, we introduce VE-Bench, a benchmark suite tailored to the assessment of text-driven video editing. This suite includes VE-Bench DB, a video quality assessment (VQA) database for video editing. VE-Bench DB encompasses a diverse set of source videos featuring various motions and subjects, along with multiple distinct editing prompts, editing results from 8 different models, and the corresponding Mean Opinion Scores (MOS) from 24 human annotators. Based on VE-Bench DB, we further propose VE-Bench QA, a quantitative human-aligned measurement for the text-driven video editing task. In addition to the aesthetic, distortion, and other visual quality indicators that traditional VQA methods emphasize, VE-Bench QA focuses on the text-video alignment and the relevance modeling between source and edited videos. It proposes a new assessment network for video editing that attains superior performance in alignment with human preferences. To the best of our knowledge, VE-Bench introduces the first quality assessment dataset for video editing and an effective subjective-aligned quantitative metric for this domain. All data and code will be publicly available at https://github.com/littlespray/VE-Bench.

VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment

TL;DR

Abstract

Paper Structure (24 sections, 4 equations, 7 figures, 4 tables)

This paper contains 24 sections, 4 equations, 7 figures, 4 tables.

Introduction
Related Work
Metrics for Video Editing
Datasets for Video Editing Assessment
Methods for Video Editing
VE-Bench DB: Subjective-Aligned Dataset for Text-Driven Video Editing
Source Video Collection
Prompt Selection
Video Editing
Subjective Study
Dataset Analysis
VE-Bench QA: Subjective-Aligned Metric for Text-Driven Video Editing
Video-Text Alignment
Source-Target Relationship
Visual Quality
...and 9 more sections

Figures (7)

Figure 1: Overview of the proposed VE-Bench.
Figure 2: Collection of source videos. (a) Sources of videos. (b) Types of videos. (c) Motion categories. (d) Content categories.
Figure 3: Statistics of VE-Bench DB prompts. (a) Word cloud of VE-Bench DB prompts. (b) Proportion of different types
Figure 4: Statistics on MOS. (a) The distribution of the raw/Z-score MOS. (b) Z-score MOS distributions of 8 editing methods.
Figure 5: Model performance on different types of prompts.
...and 2 more figures

VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment

TL;DR

Abstract

VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (7)