Table of Contents
Fetching ...

Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation

Yachun Mi, Yu Li, Yanting Li, Chen Hui, Tong Zhang, Zhixuan Li, Chenyue Song, Wei Yang Bryan Lim, Shaohui Liu

TL;DR

Q-CLIP presents a fully Vision-Language Model–based framework for Video Quality Assessment, addressing the data- and compute-inefficiencies of pretraining on large classification datasets by integrating a lightweight Shared Cross-Modal Adapter and learnable five-level prompts. The approach leverages cross-modal alignment and frame-difference–based sampling to capture subtle quality cues across semantics, distortions, motion, and aesthetics, while keeping trainable parameters at about 0.14M. Across six datasets, pretraining on LSVQ and fine-tuning on smaller corpora, Q-CLIP achieves state-of-the-art or competitive results with significantly lower training costs than prior VLM-augmented VQA methods, and ablation studies confirm the effectiveness of SCMA and prompts. The work demonstrates the practical viability and generalization benefits of fully VLM-based VQA for video quality, with insights into frame sampling and cross-modal adaptation that can guide future quality perception research.

Abstract

Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets (e.g., ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion, aesthetics); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens or even hundreds of times greater than training directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks, and have begun to demonstrate promising potential in quality assessment. In this work, we propose Q-CLIP, the first fully VLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the VLMs in perceiving subtle quality variations, thereby further enhancing the model's sensitivity to video quality. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance, and find that frame-difference-based sampling leads to better generalization performance across datasets. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets.

Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation

TL;DR

Q-CLIP presents a fully Vision-Language Model–based framework for Video Quality Assessment, addressing the data- and compute-inefficiencies of pretraining on large classification datasets by integrating a lightweight Shared Cross-Modal Adapter and learnable five-level prompts. The approach leverages cross-modal alignment and frame-difference–based sampling to capture subtle quality cues across semantics, distortions, motion, and aesthetics, while keeping trainable parameters at about 0.14M. Across six datasets, pretraining on LSVQ and fine-tuning on smaller corpora, Q-CLIP achieves state-of-the-art or competitive results with significantly lower training costs than prior VLM-augmented VQA methods, and ablation studies confirm the effectiveness of SCMA and prompts. The work demonstrates the practical viability and generalization benefits of fully VLM-based VQA for video quality, with insights into frame sampling and cross-modal adaptation that can guide future quality perception research.

Abstract

Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets (e.g., ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion, aesthetics); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens or even hundreds of times greater than training directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks, and have begun to demonstrate promising potential in quality assessment. In this work, we propose Q-CLIP, the first fully VLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the VLMs in perceiving subtle quality variations, thereby further enhancing the model's sensitivity to video quality. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance, and find that frame-difference-based sampling leads to better generalization performance across datasets. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets.

Paper Structure

This paper contains 23 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison of Q-CLIP with leading VQA methods on LSVQ. Q-CLIP achieves the best performance while training only a minimal number of parameters.
  • Figure 2: The overall framework of the proposed Q-CLIP.
  • Figure 3: Architecture of the proposed SCMA.
  • Figure 4: Frame Sampling Diagram.
  • Figure 5: Ablation on the number of E-SCMA Layers.
  • ...and 1 more figures