AIGVE-Tool: AI-Generated Video Evaluation Toolkit with Multifaceted Benchmark
Xinhao Xiang, Xiao Liu, Zizhong Li, Zhuosheng Liu, Jiawei Zhang
TL;DR
The paper tackles the fragmented landscape of AI-generated video evaluation by introducing AIGVE-Tool, a modular, configuration-driven toolkit with a five-category metric taxonomy. It pairs this framework with AIGVE-Bench, a large, human-annotated benchmark (500 prompts, 2,430 videos, 21,870 scores across nine aspects) to enable standardized, multi-dimensional evaluation across state-of-the-art video generators. The approach emphasizes modularity, reproducibility, and extensibility through AIGVELoop, Customizable DataLoaders, and Modular Metrics, facilitating easy integration of new datasets and metrics. Empirical results reveal current models excel on natural subjects but struggle with urban scenes and complex interactions, and demonstrate that regression-based fusion of metrics better aligns automatic scores with human judgments, advancing robust, scalable AIGVE research.
Abstract
The rapid advancement in AI-generated video synthesis has led to a growth demand for standardized and effective evaluation metrics. Existing metrics lack a unified framework for systematically categorizing methodologies, limiting a holistic understanding of the evaluation landscape. Additionally, fragmented implementations and the absence of standardized interfaces lead to redundant processing overhead. Furthermore, many prior approaches are constrained by dataset-specific dependencies, limiting their applicability across diverse video domains. To address these challenges, we introduce AIGVE-Tool (AI-Generated Video Evaluation Toolkit), a unified framework that provides a structured and extensible evaluation pipeline for a comprehensive AI-generated video evaluation. Organized within a novel five-category taxonomy, AIGVE-Tool integrates multiple evaluation methodologies while allowing flexible customization through a modular configuration system. Additionally, we propose AIGVE-Bench, a large-scale benchmark dataset created with five SOTA video generation models based on hand-crafted instructions and prompts. This dataset systematically evaluates various video generation models across nine critical quality dimensions. Extensive experiments demonstrate the effectiveness of AIGVE-Tool in providing standardized and reliable evaluation results, highlighting specific strengths and limitations of current models and facilitating the advancements of next-generation AI-generated video techniques.
