Table of Contents
Fetching ...

Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment

Jiaze Li, Haoran Xu, Shiding Zhu, Junwei He, Haozhao Wang

TL;DR

This work addresses the challenge of assessing AI-generated video quality by introducing MSA-VQA, a multilevel semantic-aware framework that analyzes video content at frame, segment, and full-video scales. It combines CLIP-based Prompt Semantic Supervision to ensure semantic alignment with prompts and a Semantic Mutation-aware module to detect subtle frame-to-frame semantic changes, both integrated within a three-branch ensemble trained with specialized losses. The approach achieves state-of-the-art results on AI-generated VQA benchmarks, demonstrating the benefit of aligning perceptual quality with semantic coherence and mutation dynamics in generated videos. The findings suggest that incorporating semantic supervision and cross-attentive mutation modeling can significantly improve the reliability of VQA for AI-generated content, with practical implications for quality control in AIGC pipelines.

Abstract

The rapid development of diffusion models has greatly advanced AI-generated videos in terms of length and consistency recently, yet assessing AI-generated videos still remains challenging. Previous approaches have often focused on User-Generated Content(UGC), but few have targeted AI-Generated Video Quality Assessment methods. In this work, we introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment, which leverages CLIP-based semantic supervision and cross-attention mechanisms. Our hierarchical framework analyzes video content at three levels: frame, segment, and video. We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts. Additionally, we propose the Semantic Mutation-aware Module to capture subtle variations between frames. Extensive experiments demonstrate our method achieves state-of-the-art results.

Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment

TL;DR

This work addresses the challenge of assessing AI-generated video quality by introducing MSA-VQA, a multilevel semantic-aware framework that analyzes video content at frame, segment, and full-video scales. It combines CLIP-based Prompt Semantic Supervision to ensure semantic alignment with prompts and a Semantic Mutation-aware module to detect subtle frame-to-frame semantic changes, both integrated within a three-branch ensemble trained with specialized losses. The approach achieves state-of-the-art results on AI-generated VQA benchmarks, demonstrating the benefit of aligning perceptual quality with semantic coherence and mutation dynamics in generated videos. The findings suggest that incorporating semantic supervision and cross-attentive mutation modeling can significantly improve the reliability of VQA for AI-generated content, with practical implications for quality control in AIGC pipelines.

Abstract

The rapid development of diffusion models has greatly advanced AI-generated videos in terms of length and consistency recently, yet assessing AI-generated videos still remains challenging. Previous approaches have often focused on User-Generated Content(UGC), but few have targeted AI-Generated Video Quality Assessment methods. In this work, we introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment, which leverages CLIP-based semantic supervision and cross-attention mechanisms. Our hierarchical framework analyzes video content at three levels: frame, segment, and video. We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts. Additionally, we propose the Semantic Mutation-aware Module to capture subtle variations between frames. Extensive experiments demonstrate our method achieves state-of-the-art results.
Paper Structure (15 sections, 11 equations, 2 figures, 4 tables)

This paper contains 15 sections, 11 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Illustration of the MSA-VQA framework. The framework includes three main components capturing features at the video, segment, and frame levels, as shown in (a). These components are trained separately for stability and ensembled during inference. A Prompt Semantic Supervision (PSS) module, based on the CLIP text encoder, ensures semantic alignment between the AI-Generated video and the prompt, as shown in (b). The Semantic Mutation-aware (SMA) Module models the semantic mutations between video frames, as indicated in (c).
  • Figure 2: The four images above are from a video generated with the prompt: Time lapse of a field on which a tractor passes with a machine used to collect the cut grass and then make bales of hay, with the passage of white clouds on the blue sky. The generated tractor (highlighted in yellow) shows significant instability and semantic mutations.