Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment
Jiaze Li, Haoran Xu, Shiding Zhu, Junwei He, Haozhao Wang
TL;DR
This work addresses the challenge of assessing AI-generated video quality by introducing MSA-VQA, a multilevel semantic-aware framework that analyzes video content at frame, segment, and full-video scales. It combines CLIP-based Prompt Semantic Supervision to ensure semantic alignment with prompts and a Semantic Mutation-aware module to detect subtle frame-to-frame semantic changes, both integrated within a three-branch ensemble trained with specialized losses. The approach achieves state-of-the-art results on AI-generated VQA benchmarks, demonstrating the benefit of aligning perceptual quality with semantic coherence and mutation dynamics in generated videos. The findings suggest that incorporating semantic supervision and cross-attentive mutation modeling can significantly improve the reliability of VQA for AI-generated content, with practical implications for quality control in AIGC pipelines.
Abstract
The rapid development of diffusion models has greatly advanced AI-generated videos in terms of length and consistency recently, yet assessing AI-generated videos still remains challenging. Previous approaches have often focused on User-Generated Content(UGC), but few have targeted AI-Generated Video Quality Assessment methods. In this work, we introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment, which leverages CLIP-based semantic supervision and cross-attention mechanisms. Our hierarchical framework analyzes video content at three levels: frame, segment, and video. We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts. Additionally, we propose the Semantic Mutation-aware Module to capture subtle variations between frames. Extensive experiments demonstrate our method achieves state-of-the-art results.
