Table of Contents
Fetching ...

CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video

Xinyi Wang, Angeliki Katsenou, Junxiao Shen, David Bull

TL;DR

This work tackles no-reference video quality assessment for consumer-generated content by leveraging vision-language models to generate quality-aware captions and by fusing semantic, temporal, and spatial video features. A quality-aware prompting mechanism guides BLIP-2 to produce fine-grained captions; a Frame Difference Fragmentation module highlights distortion-prone regions for targeted captioning. The SEE, SlowFast-based TME, and Swin-based SVE form a unified multimodal feature set that is regressed by an MLP with a composite loss to predict perceptual quality scores. Empirically, CAMP-VQA achieves state-of-the-art performance across six UGC NR-VQA datasets and demonstrates strong cross-dataset generalization without manual fine-grained artifact annotations, offering a scalable approach for real-world video delivery optimization.

Abstract

The prevalence of user-generated content (UGC) on platforms such as YouTube and TikTok has rendered no-reference (NR) perceptual video quality assessment (VQA) vital for optimizing video delivery. Nonetheless, the characteristics of non-professional acquisition and the subsequent transcoding of UGC video on sharing platforms present significant challenges for NR-VQA. Although NR-VQA models attempt to infer mean opinion scores (MOS), their modeling of subjective scores for compressed content remains limited due to the absence of fine-grained perceptual annotations of artifact types. To address these challenges, we propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large vision-language models. Our approach introduces a quality-aware prompting mechanism that integrates video metadata (e.g., resolution, frame rate, bitrate) with key fragments extracted from inter-frame variations to guide the BLIP-2 pretraining approach in generating fine-grained quality captions. A unified architecture has been designed to model perceptual quality across three dimensions: semantic alignment, temporal characteristics, and spatial characteristics. These multimodal features are extracted and fused, then regressed to video quality scores. Extensive experiments on a wide variety of UGC datasets demonstrate that our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations. Our method achieves the best performance in terms of average rank and linear correlation (SRCC: 0.928, PLCC: 0.938) compared to state-of-the-art methods. The source code and trained models, along with a user-friendly demo, are available at: https://github.com/xinyiW915/CAMP-VQA.

CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video

TL;DR

This work tackles no-reference video quality assessment for consumer-generated content by leveraging vision-language models to generate quality-aware captions and by fusing semantic, temporal, and spatial video features. A quality-aware prompting mechanism guides BLIP-2 to produce fine-grained captions; a Frame Difference Fragmentation module highlights distortion-prone regions for targeted captioning. The SEE, SlowFast-based TME, and Swin-based SVE form a unified multimodal feature set that is regressed by an MLP with a composite loss to predict perceptual quality scores. Empirically, CAMP-VQA achieves state-of-the-art performance across six UGC NR-VQA datasets and demonstrates strong cross-dataset generalization without manual fine-grained artifact annotations, offering a scalable approach for real-world video delivery optimization.

Abstract

The prevalence of user-generated content (UGC) on platforms such as YouTube and TikTok has rendered no-reference (NR) perceptual video quality assessment (VQA) vital for optimizing video delivery. Nonetheless, the characteristics of non-professional acquisition and the subsequent transcoding of UGC video on sharing platforms present significant challenges for NR-VQA. Although NR-VQA models attempt to infer mean opinion scores (MOS), their modeling of subjective scores for compressed content remains limited due to the absence of fine-grained perceptual annotations of artifact types. To address these challenges, we propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large vision-language models. Our approach introduces a quality-aware prompting mechanism that integrates video metadata (e.g., resolution, frame rate, bitrate) with key fragments extracted from inter-frame variations to guide the BLIP-2 pretraining approach in generating fine-grained quality captions. A unified architecture has been designed to model perceptual quality across three dimensions: semantic alignment, temporal characteristics, and spatial characteristics. These multimodal features are extracted and fused, then regressed to video quality scores. Extensive experiments on a wide variety of UGC datasets demonstrate that our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations. Our method achieves the best performance in terms of average rank and linear correlation (SRCC: 0.928, PLCC: 0.938) compared to state-of-the-art methods. The source code and trained models, along with a user-friendly demo, are available at: https://github.com/xinyiW915/CAMP-VQA.

Paper Structure

This paper contains 22 sections, 17 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: CAMP-VQA automatically generates quality-aware captions and fuses spatial, temporal, and semantic features to predict video quality.
  • Figure 2: Example of quality-aware captioning from a sampled video frame and fragments.
  • Figure 3: Frame difference fragmentation (FDF) module.
  • Figure 4: Quality-related prompt hints derived from video metadata.
  • Figure 5: Different quality-aware prompt settings: quality prompt, fragment prompt, and residual prompt. We also include a content prompt for the ablation study.
  • ...and 1 more figures