Table of Contents
Fetching ...

Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap

Bowen Qu, Xiaoyu Liang, Shangkun Sun, Wei Gao

TL;DR

This work tackles AIGC video quality assessment by introducing a tri-dimensional framework: visual harmony, video-text consistency, and domain distribution gap. It integrates a DOVER-based visual backbone with learnable pooling, explicit prompt injection, implicit text guidance, and Video-LLaVA caption similarity, augmented by an auxiliary classifier to handle model-origin distribution. The approach achieves strong results on NTIRE 2024 QA Track 2, with ablations validating the contribution of each component and outperforming several baselines. The findings highlight the value of multi-modal cues and model-origin awareness for reliable AIGC video QA and provide a reproducible codebase for future research.

Abstract

The recent advancements in Text-to-Video Artificial Intelligence Generated Content (AIGC) have been remarkable. Compared with traditional videos, the assessment of AIGC videos encounters various challenges: visual inconsistency that defy common sense, discrepancies between content and the textual prompt, and distribution gap between various generative models, etc. Target at these challenges, in this work, we categorize the assessment of AIGC video quality into three dimensions: visual harmony, video-text consistency, and domain distribution gap. For each dimension, we design specific modules to provide a comprehensive quality assessment of AIGC videos. Furthermore, our research identifies significant variations in visual quality, fluidity, and style among videos generated by different text-to-video models. Predicting the source generative model can make the AIGC video features more discriminative, which enhances the quality assessment performance. The proposed method was used in the third-place winner of the NTIRE 2024 Quality Assessment for AI-Generated Content - Track 2 Video, demonstrating its effectiveness. Code will be available at https://github.com/Coobiw/TriVQA.

Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap

TL;DR

This work tackles AIGC video quality assessment by introducing a tri-dimensional framework: visual harmony, video-text consistency, and domain distribution gap. It integrates a DOVER-based visual backbone with learnable pooling, explicit prompt injection, implicit text guidance, and Video-LLaVA caption similarity, augmented by an auxiliary classifier to handle model-origin distribution. The approach achieves strong results on NTIRE 2024 QA Track 2, with ablations validating the contribution of each component and outperforming several baselines. The findings highlight the value of multi-modal cues and model-origin awareness for reliable AIGC video QA and provide a reproducible codebase for future research.

Abstract

The recent advancements in Text-to-Video Artificial Intelligence Generated Content (AIGC) have been remarkable. Compared with traditional videos, the assessment of AIGC videos encounters various challenges: visual inconsistency that defy common sense, discrepancies between content and the textual prompt, and distribution gap between various generative models, etc. Target at these challenges, in this work, we categorize the assessment of AIGC video quality into three dimensions: visual harmony, video-text consistency, and domain distribution gap. For each dimension, we design specific modules to provide a comprehensive quality assessment of AIGC videos. Furthermore, our research identifies significant variations in visual quality, fluidity, and style among videos generated by different text-to-video models. Predicting the source generative model can make the AIGC video features more discriminative, which enhances the quality assessment performance. The proposed method was used in the third-place winner of the NTIRE 2024 Quality Assessment for AI-Generated Content - Track 2 Video, demonstrating its effectiveness. Code will be available at https://github.com/Coobiw/TriVQA.
Paper Structure (18 sections, 11 equations, 3 figures, 3 tables)

This paper contains 18 sections, 11 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Three Dimensions for AIGC Video Quality Assessment.
  • Figure 2: Detailed overview of our framework. (a) illustrates the whole framework, which serves as an incremental enhancement for DOVER. Except for the visual part, our framework also incorporates modules to deal with the explicit prompt and implicit text, enriching the capability in video-text consistency assessment. (b) shows the workflow of the Text2Video Cross Attention Pooling module, which is based on cross-attention mechanism.
  • Figure 3: One example used in In-Context-Learning for Video-LLaVA to generate the prompt-like caption.