Table of Contents
Fetching ...

CLIPVQA:Video Quality Assessment via CLIP

Fengchuang Xing, Mingjie Li, Yuan-Gen Wang, Guopu Zhu, Xiaochun Cao

TL;DR

CLIPVQA tackles blind video quality assessment for in-the-wild videos by integrating CLIP-derived language supervision with a video-focused Transformer architecture. It introduces a frame perception stage (FPT) with CLIP-based aggressive transformers, a spatiotemporal quality aggregation module (SAT), a MOS2Language encoder for quality-language supervision, and a video-content–language aggregator (VAT) to fuse visual and textual cues. A vectorized regression loss aligns predicted MOS distributions with human judgments, while a final SVR maps probabilities to real-valued quality scores. Across eight diverse datasets, CLIPVQA achieves state-of-the-art VQA performance and notable cross-dataset generalization, with ablations highlighting the importance of fusion tokens, SAT, language supervision, and VR loss for effectiveness and efficiency.

Abstract

In learning vision-language representations from web-scale data, the contrastive language-image pre-training (CLIP) mechanism has demonstrated a remarkable performance in many vision tasks. However, its application to the widely studied video quality assessment (VQA) task is still an open issue. In this paper, we propose an efficient and effective CLIP-based Transformer method for the VQA problem (CLIPVQA). Specifically, we first design an effective video frame perception paradigm with the goal of extracting the rich spatiotemporal quality and content information among video frames. Then, the spatiotemporal quality features are adequately integrated together using a self-attention mechanism to yield video-level quality representation. To utilize the quality language descriptions of videos for supervision, we develop a CLIP-based encoder for language embedding, which is then fully aggregated with the generated content information via a cross-attention module for producing video-language representation. Finally, the video-level quality and video-language representations are fused together for final video quality prediction, where a vectorized regression loss is employed for efficient end-to-end optimization. Comprehensive experiments are conducted on eight in-the-wild video datasets with diverse resolutions to evaluate the performance of CLIPVQA. The experimental results show that the proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods. A series of ablation studies are also performed to validate the effectiveness of each module in CLIPVQA.

CLIPVQA:Video Quality Assessment via CLIP

TL;DR

CLIPVQA tackles blind video quality assessment for in-the-wild videos by integrating CLIP-derived language supervision with a video-focused Transformer architecture. It introduces a frame perception stage (FPT) with CLIP-based aggressive transformers, a spatiotemporal quality aggregation module (SAT), a MOS2Language encoder for quality-language supervision, and a video-content–language aggregator (VAT) to fuse visual and textual cues. A vectorized regression loss aligns predicted MOS distributions with human judgments, while a final SVR maps probabilities to real-valued quality scores. Across eight diverse datasets, CLIPVQA achieves state-of-the-art VQA performance and notable cross-dataset generalization, with ablations highlighting the importance of fusion tokens, SAT, language supervision, and VR loss for effectiveness and efficiency.

Abstract

In learning vision-language representations from web-scale data, the contrastive language-image pre-training (CLIP) mechanism has demonstrated a remarkable performance in many vision tasks. However, its application to the widely studied video quality assessment (VQA) task is still an open issue. In this paper, we propose an efficient and effective CLIP-based Transformer method for the VQA problem (CLIPVQA). Specifically, we first design an effective video frame perception paradigm with the goal of extracting the rich spatiotemporal quality and content information among video frames. Then, the spatiotemporal quality features are adequately integrated together using a self-attention mechanism to yield video-level quality representation. To utilize the quality language descriptions of videos for supervision, we develop a CLIP-based encoder for language embedding, which is then fully aggregated with the generated content information via a cross-attention module for producing video-language representation. Finally, the video-level quality and video-language representations are fused together for final video quality prediction, where a vectorized regression loss is employed for efficient end-to-end optimization. Comprehensive experiments are conducted on eight in-the-wild video datasets with diverse resolutions to evaluate the performance of CLIPVQA. The experimental results show that the proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods. A series of ablation studies are also performed to validate the effectiveness of each module in CLIPVQA.
Paper Structure (28 sections, 21 equations, 5 figures, 16 tables)

This paper contains 28 sections, 21 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: An illustration of the advantage of using natural language as supervision. These video frames are extracted from two natural VQA datasets, along with their corresponding quality text descriptions generated by a vision-language model BuboGPTzhao2023bubogpt. With the help of these text descriptions strongly correlated to actual subjective assessment, a VQA model is more likely to make a better assessment that (a) and (c) are high-quality while (b) and (d) are low-quality frames.
  • Figure 2: An overview of the proposed CLIPVQA framework. It includes a frame perception Transformer (FPT), a spatiotemporal quality aggregation Transformer (SAT), a MOS2Language encoder, a video content and language aggregation Transformer (VAT), a fusion operation, and a vectorized regression (VR) loss for optimization.
  • Figure 3: An illustration of the frame perception Transformer (FPT), which consists of $L$ CAT blocks. The frame encoder is similar to the pre-trained image encoder in CLIP.
  • Figure 4: An illustration of video content and language aggregation Transformer (VAT), which contains $B$ CandLA blocks.
  • Figure 5: A visualization of the predicted distributions by CLIPVQA on a set of randomly selected samples from KoNViD-1k dataset.