Table of Contents
Fetching ...

EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models

Yupeng Chen, Penglin Chen, Xiaoyu Zhang, Yixian Huang, Qian Xie

TL;DR

EditBoard addresses the lack of standardized evaluation for text-based video editing by proposing a comprehensive benchmark with four dimensions and nine metrics, including three novel fidelity metrics FF-α, FF-β, and Semantic Score. It formalizes the editing problem as $E(f_0,f_1, \ldots,f_n; p_s,p_t) = (f_0',f_1',\ldots,f_n')$ and couples fidelity, execution, consistency, and style to a task-oriented testing regime. The paper validates EditBoard through results on multiple SOTA models, demonstrates alignment with human judgments via a transcript-based evaluation and Pearson correlations, and analyzes the fidelity-execution trade-offs across editing approaches. By open-sourcing EditBoard, the work aims to standardize evaluation, reveal model strengths and weaknesses, and spur the development of more robust text-based video editing models with practical impact in AIGC research and deployment.

Abstract

The rapid development of diffusion models has significantly advanced AI-generated content (AIGC), particularly in Text-to-Image (T2I) and Text-to-Video (T2V) generation. Text-based video editing, leveraging these generative capabilities, has emerged as a promising field, enabling precise modifications to videos based on text prompts. Despite the proliferation of innovative video editing models, there is a conspicuous lack of comprehensive evaluation benchmarks that holistically assess these models' performance across various dimensions. Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score, which obscures models' effectiveness on individual editing tasks. To address this gap, we propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models. EditBoard encompasses nine automatic metrics across four dimensions, evaluating models on four task categories and introducing three new metrics to assess fidelity. This task-oriented benchmark facilitates objective evaluation by detailing model performance and providing insights into each model's strengths and weaknesses. By open-sourcing EditBoard, we aim to standardize evaluation and advance the development of robust video editing models.

EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models

TL;DR

EditBoard addresses the lack of standardized evaluation for text-based video editing by proposing a comprehensive benchmark with four dimensions and nine metrics, including three novel fidelity metrics FF-α, FF-β, and Semantic Score. It formalizes the editing problem as and couples fidelity, execution, consistency, and style to a task-oriented testing regime. The paper validates EditBoard through results on multiple SOTA models, demonstrates alignment with human judgments via a transcript-based evaluation and Pearson correlations, and analyzes the fidelity-execution trade-offs across editing approaches. By open-sourcing EditBoard, the work aims to standardize evaluation, reveal model strengths and weaknesses, and spur the development of more robust text-based video editing models with practical impact in AIGC research and deployment.

Abstract

The rapid development of diffusion models has significantly advanced AI-generated content (AIGC), particularly in Text-to-Image (T2I) and Text-to-Video (T2V) generation. Text-based video editing, leveraging these generative capabilities, has emerged as a promising field, enabling precise modifications to videos based on text prompts. Despite the proliferation of innovative video editing models, there is a conspicuous lack of comprehensive evaluation benchmarks that holistically assess these models' performance across various dimensions. Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score, which obscures models' effectiveness on individual editing tasks. To address this gap, we propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models. EditBoard encompasses nine automatic metrics across four dimensions, evaluating models on four task categories and introducing three new metrics to assess fidelity. This task-oriented benchmark facilitates objective evaluation by detailing model performance and providing insights into each model's strengths and weaknesses. By open-sourcing EditBoard, we aim to standardize evaluation and advance the development of robust video editing models.
Paper Structure (20 sections, 5 equations, 4 figures, 3 tables)

This paper contains 20 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of EditBoard. We propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models. We design a task-oriented evaluation benchmark with four dimensions that break down models' performance across multiple levels, facilitating objective evaluation and offering valuable insights. Additionally, we introduce three new metrics and apply nine metrics in total that cover all the evaluation dimensions. EditBoard produces a transcript for each model to discover its advantages and limitations. We also conduct Human Preference Annotation for the edited videos, demonstrating that EditBoard evaluation results align closely with human perception.
  • Figure 2: We visualize the errors in reconstructing edited frames and original frames both using optical flows from the original video. For videos satisfying the requirement of FF-$\alpha$, reconstructing original frames yields minor errors compared to reconstructing edited frames. We use the reconstruction error of edited frames to calculate FF-$\alpha$.
  • Figure 3: Categorization of video editing tasks.
  • Figure 4: Visualization of FateZero, Control-A-Video, Ground-A-Video, Video-P2P, and TokenFlow's performance on four tasks. Most models perform worse at SOMA and MOA, verifying our categorization of tasks into different levels.