Table of Contents
Fetching ...

Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency

Shangkun Sun, Xiaoyu Liang, Bowen Qu, Wei Gao

TL;DR

This work tackles the challenge of evaluating content-rich AIGC videos generated by next-generation models like Sora, where long, detailed prompts and complex motions degrade the reliability of traditional VQA metrics. It introduces CRAVE, a three-branch evaluator combining visual harmony (DOVER-based), text-video semantic alignment (Multi-Granularity Text-Temporal fusion), and motion-aware fidelity (Hybrid Motion-fidelity Modeling), and CRAVE-DB, a large-scale benchmark with 1,228 videos and 410 elaborate prompts assessed by 29 annotators. Extensive experiments on CRAVE-DB and T2VQA-DB show that CRAVE achieves leading human-aligned performance, including robust zero-shot generalization to newer models. The approach advances precise, human-correlated evaluation for evolving AIGC video generation and provides public data and code to foster further research.

Abstract

The advent of next-generation video generation models like \textit{Sora} poses challenges for AI-generated content (AIGC) video quality assessment (VQA). These models substantially mitigate flickering artifacts prevalent in prior models, enable longer and complex text prompts and generate longer videos with intricate, diverse motion patterns. Conventional VQA methods designed for simple text and basic motion patterns struggle to evaluate these content-rich videos. To this end, we propose \textbf{CRAVE} (\underline{C}ontent-\underline{R}ich \underline{A}IGC \underline{V}ideo \underline{E}valuator), specifically for the evaluation of Sora-era AIGC videos. CRAVE proposes the multi-granularity text-temporal fusion that aligns long-form complex textual semantics with video dynamics. Additionally, CRAVE leverages the hybrid motion-fidelity modeling to assess temporal artifacts. Furthermore, given the straightforward prompts and content in current AIGC VQA datasets, we introduce \textbf{CRAVE-DB}, a benchmark featuring content-rich videos from next-generation models paired with elaborate prompts. Extensive experiments have shown that the proposed CRAVE achieves excellent results on multiple AIGC VQA benchmarks, demonstrating a high degree of alignment with human perception. All data and code will be publicly available at https://github.com/littlespray/CRAVE.

Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency

TL;DR

This work tackles the challenge of evaluating content-rich AIGC videos generated by next-generation models like Sora, where long, detailed prompts and complex motions degrade the reliability of traditional VQA metrics. It introduces CRAVE, a three-branch evaluator combining visual harmony (DOVER-based), text-video semantic alignment (Multi-Granularity Text-Temporal fusion), and motion-aware fidelity (Hybrid Motion-fidelity Modeling), and CRAVE-DB, a large-scale benchmark with 1,228 videos and 410 elaborate prompts assessed by 29 annotators. Extensive experiments on CRAVE-DB and T2VQA-DB show that CRAVE achieves leading human-aligned performance, including robust zero-shot generalization to newer models. The approach advances precise, human-correlated evaluation for evolving AIGC video generation and provides public data and code to foster further research.

Abstract

The advent of next-generation video generation models like \textit{Sora} poses challenges for AI-generated content (AIGC) video quality assessment (VQA). These models substantially mitigate flickering artifacts prevalent in prior models, enable longer and complex text prompts and generate longer videos with intricate, diverse motion patterns. Conventional VQA methods designed for simple text and basic motion patterns struggle to evaluate these content-rich videos. To this end, we propose \textbf{CRAVE} (\underline{C}ontent-\underline{R}ich \underline{A}IGC \underline{V}ideo \underline{E}valuator), specifically for the evaluation of Sora-era AIGC videos. CRAVE proposes the multi-granularity text-temporal fusion that aligns long-form complex textual semantics with video dynamics. Additionally, CRAVE leverages the hybrid motion-fidelity modeling to assess temporal artifacts. Furthermore, given the straightforward prompts and content in current AIGC VQA datasets, we introduce \textbf{CRAVE-DB}, a benchmark featuring content-rich videos from next-generation models paired with elaborate prompts. Extensive experiments have shown that the proposed CRAVE achieves excellent results on multiple AIGC VQA benchmarks, demonstrating a high degree of alignment with human perception. All data and code will be publicly available at https://github.com/littlespray/CRAVE.

Paper Structure

This paper contains 23 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison of concurrent and previous AIGC videos. Videos are generated by Lavie wang2023lavie (1st row) and Sora sora (2nd row), respectively. Nouns that should be present in the video are highlighted in orange, while adjectives with more details are highlighted in blue. The new-generation AIGC videos contain richer content.
  • Figure 2: Word cloud of prompts in CRAVE-DB.
  • Figure 3: The collection of the proposed CRAVE-DB.
  • Figure 4: Distribution of MOS in CRAVE-DB.
  • Figure 5: Network overview of the proposed CRAVE.
  • ...and 3 more figures