Table of Contents
Fetching ...

PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild

Kun Yuan, Hongbo Liu, Mading Li, Muyi Sun, Ming Sun, Jiachao Gong, Jinhua Hao, Chao Zhou, Yansong Tang

TL;DR

This work tackles no-reference video quality assessment by exploiting a large pool of frozen pretrained models from diverse pretext tasks to obtain quality-sensitive representations. It introduces an ICID loss that enforces model-wise intra-consistency and sample-wise inter-divisibility, and uses Davies-Bouldin Index (DBI) to select and weight pretrained models, enabling efficient fusion with few learnable parameters. The approach achieves competitive or state-of-the-art performance on NR-VQA benchmarks such as KoNViD-1k, LIVE-VQC, and YouTube-UGC, and shows strong cross-dataset generalization to LSVQ, all with reduced training cost. This methodology reduces reliance on MOS annotations and scales well with expanding pretrained-model libraries, offering practical benefits for real-world VQA deployment.

Abstract

Video quality assessment (VQA) is a challenging problem due to the numerous factors that can affect the perceptual quality of a video, \eg, content attractiveness, distortion type, motion pattern, and level. However, annotating the Mean opinion score (MOS) for videos is expensive and time-consuming, which limits the scale of VQA datasets, and poses a significant obstacle for deep learning-based methods. In this paper, we propose a VQA method named PTM-VQA, which leverages PreTrained Models to transfer knowledge from models pretrained on various pre-tasks, enabling benefits for VQA from different aspects. Specifically, we extract features of videos from different pretrained models with frozen weights and integrate them to generate representation. Since these models possess various fields of knowledge and are often trained with labels irrelevant to quality, we propose an Intra-Consistency and Inter-Divisibility (ICID) loss to impose constraints on features extracted by multiple pretrained models. The intra-consistency constraint ensures that features extracted by different pretrained models are in the same unified quality-aware latent space, while the inter-divisibility introduces pseudo clusters based on the annotation of samples and tries to separate features of samples from different clusters. Furthermore, with a constantly growing number of pretrained models, it is crucial to determine which models to use and how to use them. To address this problem, we propose an efficient scheme to select suitable candidates. Models with better clustering performance on VQA datasets are chosen to be our candidates. Extensive experiments demonstrate the effectiveness of the proposed method.

PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild

TL;DR

This work tackles no-reference video quality assessment by exploiting a large pool of frozen pretrained models from diverse pretext tasks to obtain quality-sensitive representations. It introduces an ICID loss that enforces model-wise intra-consistency and sample-wise inter-divisibility, and uses Davies-Bouldin Index (DBI) to select and weight pretrained models, enabling efficient fusion with few learnable parameters. The approach achieves competitive or state-of-the-art performance on NR-VQA benchmarks such as KoNViD-1k, LIVE-VQC, and YouTube-UGC, and shows strong cross-dataset generalization to LSVQ, all with reduced training cost. This methodology reduces reliance on MOS annotations and scales well with expanding pretrained-model libraries, offering practical benefits for real-world VQA deployment.

Abstract

Video quality assessment (VQA) is a challenging problem due to the numerous factors that can affect the perceptual quality of a video, \eg, content attractiveness, distortion type, motion pattern, and level. However, annotating the Mean opinion score (MOS) for videos is expensive and time-consuming, which limits the scale of VQA datasets, and poses a significant obstacle for deep learning-based methods. In this paper, we propose a VQA method named PTM-VQA, which leverages PreTrained Models to transfer knowledge from models pretrained on various pre-tasks, enabling benefits for VQA from different aspects. Specifically, we extract features of videos from different pretrained models with frozen weights and integrate them to generate representation. Since these models possess various fields of knowledge and are often trained with labels irrelevant to quality, we propose an Intra-Consistency and Inter-Divisibility (ICID) loss to impose constraints on features extracted by multiple pretrained models. The intra-consistency constraint ensures that features extracted by different pretrained models are in the same unified quality-aware latent space, while the inter-divisibility introduces pseudo clusters based on the annotation of samples and tries to separate features of samples from different clusters. Furthermore, with a constantly growing number of pretrained models, it is crucial to determine which models to use and how to use them. To address this problem, we propose an efficient scheme to select suitable candidates. Models with better clustering performance on VQA datasets are chosen to be our candidates. Extensive experiments demonstrate the effectiveness of the proposed method.
Paper Structure (22 sections, 7 equations, 4 figures, 7 tables)

This paper contains 22 sections, 7 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The KoNViD-1k dataset provides video frames that demonstrate a correlation between content/motion patterns and video quality. To identify potential reasons for poor perceptual video quality, we have highlighted specific factors in italics that correspond to the labeled MOS.
  • Figure 2: Visualization of clustering results of features extracted by different pretrained models using t-SNE DBLP:journals/jmlr/Maaten09. Videos in KoNViD-1k DBLP:conf/qomex/HosuHJLMSLS17 are used. The number of cluster centers is set to be 6 according to the range of MOS values. And DBI scores, which will be introduced in detail in Sec. \ref{['sec:dbi']}, measure the divergence of clustering results (the smaller, the better).
  • Figure 3: The pipeline of the proposed PTM-VQA. Features of input videos are extracted by pretrained models with frozen weights, transformed to the same dimension, and integrated to generate the final representation. Expect for the ordinary smooth $\mathcal{L}_1$ loss for regression, we add an ICID loss to ensure model-wise consistency and sample-wise divisibility.
  • Figure 4: Illustration of ICID loss. The figure shows examples of several triplets of triplet loss; two sets of intra-consistency between features extracted by four pretrained models; and one sample (with two triplets) for inter-divisibility.