Table of Contents
Fetching ...

AIGVE-Tool: AI-Generated Video Evaluation Toolkit with Multifaceted Benchmark

Xinhao Xiang, Xiao Liu, Zizhong Li, Zhuosheng Liu, Jiawei Zhang

TL;DR

The paper tackles the fragmented landscape of AI-generated video evaluation by introducing AIGVE-Tool, a modular, configuration-driven toolkit with a five-category metric taxonomy. It pairs this framework with AIGVE-Bench, a large, human-annotated benchmark (500 prompts, 2,430 videos, 21,870 scores across nine aspects) to enable standardized, multi-dimensional evaluation across state-of-the-art video generators. The approach emphasizes modularity, reproducibility, and extensibility through AIGVELoop, Customizable DataLoaders, and Modular Metrics, facilitating easy integration of new datasets and metrics. Empirical results reveal current models excel on natural subjects but struggle with urban scenes and complex interactions, and demonstrate that regression-based fusion of metrics better aligns automatic scores with human judgments, advancing robust, scalable AIGVE research.

Abstract

The rapid advancement in AI-generated video synthesis has led to a growth demand for standardized and effective evaluation metrics. Existing metrics lack a unified framework for systematically categorizing methodologies, limiting a holistic understanding of the evaluation landscape. Additionally, fragmented implementations and the absence of standardized interfaces lead to redundant processing overhead. Furthermore, many prior approaches are constrained by dataset-specific dependencies, limiting their applicability across diverse video domains. To address these challenges, we introduce AIGVE-Tool (AI-Generated Video Evaluation Toolkit), a unified framework that provides a structured and extensible evaluation pipeline for a comprehensive AI-generated video evaluation. Organized within a novel five-category taxonomy, AIGVE-Tool integrates multiple evaluation methodologies while allowing flexible customization through a modular configuration system. Additionally, we propose AIGVE-Bench, a large-scale benchmark dataset created with five SOTA video generation models based on hand-crafted instructions and prompts. This dataset systematically evaluates various video generation models across nine critical quality dimensions. Extensive experiments demonstrate the effectiveness of AIGVE-Tool in providing standardized and reliable evaluation results, highlighting specific strengths and limitations of current models and facilitating the advancements of next-generation AI-generated video techniques.

AIGVE-Tool: AI-Generated Video Evaluation Toolkit with Multifaceted Benchmark

TL;DR

The paper tackles the fragmented landscape of AI-generated video evaluation by introducing AIGVE-Tool, a modular, configuration-driven toolkit with a five-category metric taxonomy. It pairs this framework with AIGVE-Bench, a large, human-annotated benchmark (500 prompts, 2,430 videos, 21,870 scores across nine aspects) to enable standardized, multi-dimensional evaluation across state-of-the-art video generators. The approach emphasizes modularity, reproducibility, and extensibility through AIGVELoop, Customizable DataLoaders, and Modular Metrics, facilitating easy integration of new datasets and metrics. Empirical results reveal current models excel on natural subjects but struggle with urban scenes and complex interactions, and demonstrate that regression-based fusion of metrics better aligns automatic scores with human judgments, advancing robust, scalable AIGVE research.

Abstract

The rapid advancement in AI-generated video synthesis has led to a growth demand for standardized and effective evaluation metrics. Existing metrics lack a unified framework for systematically categorizing methodologies, limiting a holistic understanding of the evaluation landscape. Additionally, fragmented implementations and the absence of standardized interfaces lead to redundant processing overhead. Furthermore, many prior approaches are constrained by dataset-specific dependencies, limiting their applicability across diverse video domains. To address these challenges, we introduce AIGVE-Tool (AI-Generated Video Evaluation Toolkit), a unified framework that provides a structured and extensible evaluation pipeline for a comprehensive AI-generated video evaluation. Organized within a novel five-category taxonomy, AIGVE-Tool integrates multiple evaluation methodologies while allowing flexible customization through a modular configuration system. Additionally, we propose AIGVE-Bench, a large-scale benchmark dataset created with five SOTA video generation models based on hand-crafted instructions and prompts. This dataset systematically evaluates various video generation models across nine critical quality dimensions. Extensive experiments demonstrate the effectiveness of AIGVE-Tool in providing standardized and reliable evaluation results, highlighting specific strengths and limitations of current models and facilitating the advancements of next-generation AI-generated video techniques.

Paper Structure

This paper contains 40 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overall Structure of AIGVE-Tool. Built on MMEnginemmengine2022, the framework consists of three core components: 1) Configuration Files that enable customization through Python-based settings for dataset loading, metric selection, and evaluation parameters; 2) Customizable DataLoaders that standardize diverse video formats and extract features while supporting various pre-processing operations; and 3) Modular Metrics organized within our five-category taxonomy. These components interact through the AIGVELoop system which standardizes the workflow with process() and compute_metrics() interfaces, transforming input data into structured evaluation results. This modular design allows researchers to seamlessly integrate new datasets and metrics without modifying core components.
  • Figure 2: Score distribution across different models. The top plot illustrates the models' overall performance across various object categories, while the bottom plot presents their performance across different evaluation metrics. $\mu$ represents the mean score.
  • Figure 3: Case Study of AIGVE-Bench. The top row shows that state-of-the-art video generation models excel at high-fidelity elements but struggle with interaction quality (red squares). The bottom row highlights the models are better on generating natural scenes rather than urban views.
  • Figure 4: Subject and Dynamic Distribution of AIGVE-Bench Benchmark Dataset.
  • Figure 5: Case Study of Human Evaluation for Video Generation Models. TQ: Technical Quality, DYM: Dynamics, CONS: Consistency, PHY: Physics, EP: Element Presence, EQ: Element Quality, AP: Action Presence, AQ: Action Quality, OR: Overall.