Table of Contents
Fetching ...

AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI

Fanda Fan, Chunjie Luo, Wanling Gao, Jianfeng Zhan

TL;DR

AIGCBench presents a comprehensive, scalable benchmark for Image-to-Video generation, addressing the lack of open-domain, diverse evaluation data and establishing a unified 11-metric framework across four dimensions—control-video alignment, motion effects, temporal consistency, and video quality. It uses a generation pipeline with a text combiner and GPT-4 to create rich image-text prompts and images via Text-to-Image diffusion, enabling fair comparisons across open- and closed-source I2V models. The framework is validated against human judgments and deployed on real-world datasets (WebVid-10M, LAION-5B) as well as generated image-text pairs, highlighting strengths and weaknesses of current models and guiding future improvements in fine-grained control, longer video generation, and faster inference. Open-sourcing the dataset and evaluation code, the work aims to standardize I2V benchmarking and accelerate progress in the broader AIGC landscape.

Abstract

The burgeoning field of Artificial Intelligence Generated Content (AIGC) is witnessing rapid advancements, particularly in video generation. This paper introduces AIGCBench, a pioneering comprehensive and scalable benchmark designed to evaluate a variety of video generation tasks, with a primary focus on Image-to-Video (I2V) generation. AIGCBench tackles the limitations of existing benchmarks, which suffer from a lack of diverse datasets, by including a varied and open-domain image-text dataset that evaluates different state-of-the-art algorithms under equivalent conditions. We employ a novel text combiner and GPT-4 to create rich text prompts, which are then used to generate images via advanced Text-to-Image models. To establish a unified evaluation framework for video generation tasks, our benchmark includes 11 metrics spanning four dimensions to assess algorithm performance. These dimensions are control-video alignment, motion effects, temporal consistency, and video quality. These metrics are both reference video-dependent and video-free, ensuring a comprehensive evaluation strategy. The evaluation standard proposed correlates well with human judgment, providing insights into the strengths and weaknesses of current I2V algorithms. The findings from our extensive experiments aim to stimulate further research and development in the I2V field. AIGCBench represents a significant step toward creating standardized benchmarks for the broader AIGC landscape, proposing an adaptable and equitable framework for future assessments of video generation tasks. We have open-sourced the dataset and evaluation code on the project website: https://www.benchcouncil.org/AIGCBench.

AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI

TL;DR

AIGCBench presents a comprehensive, scalable benchmark for Image-to-Video generation, addressing the lack of open-domain, diverse evaluation data and establishing a unified 11-metric framework across four dimensions—control-video alignment, motion effects, temporal consistency, and video quality. It uses a generation pipeline with a text combiner and GPT-4 to create rich image-text prompts and images via Text-to-Image diffusion, enabling fair comparisons across open- and closed-source I2V models. The framework is validated against human judgments and deployed on real-world datasets (WebVid-10M, LAION-5B) as well as generated image-text pairs, highlighting strengths and weaknesses of current models and guiding future improvements in fine-grained control, longer video generation, and faster inference. Open-sourcing the dataset and evaluation code, the work aims to standardize I2V benchmarking and accelerate progress in the broader AIGC landscape.

Abstract

The burgeoning field of Artificial Intelligence Generated Content (AIGC) is witnessing rapid advancements, particularly in video generation. This paper introduces AIGCBench, a pioneering comprehensive and scalable benchmark designed to evaluate a variety of video generation tasks, with a primary focus on Image-to-Video (I2V) generation. AIGCBench tackles the limitations of existing benchmarks, which suffer from a lack of diverse datasets, by including a varied and open-domain image-text dataset that evaluates different state-of-the-art algorithms under equivalent conditions. We employ a novel text combiner and GPT-4 to create rich text prompts, which are then used to generate images via advanced Text-to-Image models. To establish a unified evaluation framework for video generation tasks, our benchmark includes 11 metrics spanning four dimensions to assess algorithm performance. These dimensions are control-video alignment, motion effects, temporal consistency, and video quality. These metrics are both reference video-dependent and video-free, ensuring a comprehensive evaluation strategy. The evaluation standard proposed correlates well with human judgment, providing insights into the strengths and weaknesses of current I2V algorithms. The findings from our extensive experiments aim to stimulate further research and development in the I2V field. AIGCBench represents a significant step toward creating standardized benchmarks for the broader AIGC landscape, proposing an adaptable and equitable framework for future assessments of video generation tasks. We have open-sourced the dataset and evaluation code on the project website: https://www.benchcouncil.org/AIGCBench.
Paper Structure (38 sections, 4 figures, 2 tables)

This paper contains 38 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of our AIGCBench. Our AIGCBench is divided into three modules: the evaluation dataset, the evaluation metrics, and the video generation models to be assessed. Our benchmark encompasses two types of datasets: video-text and image-text datasets. To construct a more comprehensive evaluation dataset, we expand the image-text dataset by our generation pipeline. Additionally, for a thorough evaluation of video generation models, we introduce a set of evaluation metrics comprising 11 metrics across four dimensions. These metrics include both reference video-based and reference video-free metrics, making full use of the benchmark we propose. We also adopted human validation to confirm the rationality of the evaluation standards we proposed.
  • Figure 2: Image-text dataset generation pipeline and results. Above: An overview of our T2I generation pipeline is presented. Below: Eight generated cases are showcased, with the original text produced by the text combiner displayed beneath each image.
  • Figure 3: We present three I2V cases utilizing five state-of-the-art algorithms, among which VideoCrafter, I2VGen-XL, and SVD are open-source research, while Pika and Gen2 are closed-source project. For additional videos, please refer to our project website.
  • Figure 4: We tallied the votes of 42 individuals, evaluating five state-of-the-art I2V algorithms from four aspects. The numerical values in the radar chart represent the proportion of users who voted for each algorithm as being the best performer in that aspect.