Table of Contents
Fetching ...

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

Jiarui Wang, Huiyu Duan, Guangtao Zhai, Juntong Wang, Xiongkuo Min

TL;DR

A novel VQA model that leverages spatiotemporal features and LMM frameworks to capture the intricate quality attributes of AIGVs, thereby accurately predicting precise video quality scores and video pair preferences is introduced and demonstrates state-of-the-art performance.

Abstract

The rapid advancement of large multimodal models (LMMs) has led to the rapid expansion of artificial intelligence generated videos (AIGVs), which highlights the pressing need for effective video quality assessment (VQA) models designed specifically for AIGVs. Current VQA models generally fall short in accurately assessing the perceptual quality of AIGVs due to the presence of unique distortions, such as unrealistic objects, unnatural movements, or inconsistent visual elements. To address this challenge, we first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 diverse prompts. With these AIGVs, a systematic annotation pipeline including scoring and ranking processes is devised, which collects 370k expert ratings to date. Based on AIGVQA-DB, we further introduce AIGV-Assessor, a novel VQA model that leverages spatiotemporal features and LMM frameworks to capture the intricate quality attributes of AIGVs, thereby accurately predicting precise video quality scores and video pair preferences. Through comprehensive experiments on both AIGVQA-DB and existing AIGV databases, AIGV-Assessor demonstrates state-of-the-art performance, significantly surpassing existing scoring or evaluation methods in terms of multiple perceptual quality dimensions.

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

TL;DR

A novel VQA model that leverages spatiotemporal features and LMM frameworks to capture the intricate quality attributes of AIGVs, thereby accurately predicting precise video quality scores and video pair preferences is introduced and demonstrates state-of-the-art performance.

Abstract

The rapid advancement of large multimodal models (LMMs) has led to the rapid expansion of artificial intelligence generated videos (AIGVs), which highlights the pressing need for effective video quality assessment (VQA) models designed specifically for AIGVs. Current VQA models generally fall short in accurately assessing the perceptual quality of AIGVs due to the presence of unique distortions, such as unrealistic objects, unnatural movements, or inconsistent visual elements. To address this challenge, we first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 diverse prompts. With these AIGVs, a systematic annotation pipeline including scoring and ranking processes is devised, which collects 370k expert ratings to date. Based on AIGVQA-DB, we further introduce AIGV-Assessor, a novel VQA model that leverages spatiotemporal features and LMM frameworks to capture the intricate quality attributes of AIGVs, thereby accurately predicting precise video quality scores and video pair preferences. Through comprehensive experiments on both AIGVQA-DB and existing AIGV databases, AIGV-Assessor demonstrates state-of-the-art performance, significantly surpassing existing scoring or evaluation methods in terms of multiple perceptual quality dimensions.

Paper Structure

This paper contains 49 sections, 10 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: An overview of the AIGVQA-DB construction pipeline, illustrating the generation and the subjective evaluation procedures for the AIGVs in the database. (a) Prompt categorization according to the spatial major content. (b) Prompt categorization according to the temporal descriptions. (c) Prompt categorization according to the attribute control. (d) Prompt categorization according to the prompt complexity. (e) The 15 generative models used in the database. (f) Four visual quality evaluation perspectives, including static quality, temporal smoothness, dynamic degree, and text-video correspondence. (g) and (h) demonstrates the pair comparison and preference scoring processes, respectively.
  • Figure 2: Video score distribution from the four perspectives including static quality, temporal smoothness, dynamic degree, and t2v correspondence. (a) Distribution of raw scores. (b) Distribution of Mean Opinion Scores (MOSs)
  • Figure 3: Comparison of averaged win rates of different generation models across different categories. (a) Results across prompt complexity. (b) Results across attribute control. (c) Results across temporal major contents. (d) Results across spatial major contents.
  • Figure 4: (a) Comparison of text-to-video generation models regarding the MOS in terms of four dimensions sorted bottom-up by their averaged MOS. (b) Comparison of text-to-video generation models regarding the win rate in terms of four dimensions sorted bottom-up by their averaged win rate.
  • Figure 5: The framework of AIGV-Assessor: (a) AIGV-Assessor takes AI-generated video frames as input and outputs both text-based quality levels and numerical quality scores. The system begins with the extraction of spatiotemporal features using two vision encoders, which are then passed through spatial and temporal projection modules to generate aligned visual tokens into language space. The LLM decoder produces text-based feedback describing the video quality level for four evaluation dimensions, respectively. Simultaneously, the last-hidden-states from the LLM are used to perform quality regression that outputs final quality scores in terms of four dimensions. (b) AIGV-Assessor is fine-tuned on pairwise comparison, further allowing the model to output the evaluation comparison between two videos.
  • ...and 12 more figures