Table of Contents
Fetching ...

Towards Unified Video Quality Assessment

Chen Feng, Tianhao Peng, Fan Zhang, David Bull

TL;DR

Unified-VQA introduces a diagnostic Mixture-of-Experts for video quality assessment to overcome the interpretability and generalization gaps of monolithic VQA models. It uses three domain-specific perceptual experts, a SlowFast-based spatio-temporal aggregator, and a diagnostic head to output both a global quality score and a multi-dimensional artifact vector. The model is trained in three stages with expert-guided proxies, weak artifact supervision, and joint fine-tuning on subjective scores, achieving state-of-the-art performance across 18 benchmarks without per-dataset retraining. Results demonstrate strong VQA and artifact-detection performance, with explicit interpretability and robustness across HD, UHD, HDR, and HFR formats. The work supports practical deployment for real-time, multi-format streaming quality monitoring and opens avenues for further extension to VR and UGC content.

Abstract

Recent works in video quality assessment (VQA) typically employ monolithic models that typically predict a single quality score for each test video. These approaches cannot provide diagnostic, interpretable feedback, offering little insight into why the video quality is degraded. Most of them are also specialized, format-specific metrics rather than truly ``generic" solutions, as they are designed to learn a compromised representation from disparate perceptual domains. To address these limitations, this paper proposes Unified-VQA, a framework that provides a single, unified quality model applicable to various distortion types within multiple video formats by recasting generic VQA as a Diagnostic Mixture-of-Experts (MoE) problem. Unified-VQA employs multiple ``perceptual experts'' dedicated to distinct perceptual domains. A novel multi-proxy expert training strategy is designed to optimize each expert using a ranking-inspired loss, guided by the most suitable proxy metric for its domain. We also integrated a diagnostic multi-task head into this framework to generate a global quality score and an interpretable multi-dimensional artifact vector, which is optimized using a weakly-supervised learning strategy, leveraging the known properties of the large-scale training database generated for this work. With static model parameters (without retraining or fine-tuning), Unified-VQA demonstrates consistent and superior performance compared to over 18 benchmark methods for both generic VQA and diagnostic artifact detection tasks across 17 databases containing diverse streaming artifacts in HD, UHD, HDR and HFR formats. This work represents an important step towards practical, actionable, and interpretable video quality assessment.

Towards Unified Video Quality Assessment

TL;DR

Unified-VQA introduces a diagnostic Mixture-of-Experts for video quality assessment to overcome the interpretability and generalization gaps of monolithic VQA models. It uses three domain-specific perceptual experts, a SlowFast-based spatio-temporal aggregator, and a diagnostic head to output both a global quality score and a multi-dimensional artifact vector. The model is trained in three stages with expert-guided proxies, weak artifact supervision, and joint fine-tuning on subjective scores, achieving state-of-the-art performance across 18 benchmarks without per-dataset retraining. Results demonstrate strong VQA and artifact-detection performance, with explicit interpretability and robustness across HD, UHD, HDR, and HFR formats. The work supports practical deployment for real-time, multi-format streaming quality monitoring and opens avenues for further extension to VR and UGC content.

Abstract

Recent works in video quality assessment (VQA) typically employ monolithic models that typically predict a single quality score for each test video. These approaches cannot provide diagnostic, interpretable feedback, offering little insight into why the video quality is degraded. Most of them are also specialized, format-specific metrics rather than truly ``generic" solutions, as they are designed to learn a compromised representation from disparate perceptual domains. To address these limitations, this paper proposes Unified-VQA, a framework that provides a single, unified quality model applicable to various distortion types within multiple video formats by recasting generic VQA as a Diagnostic Mixture-of-Experts (MoE) problem. Unified-VQA employs multiple ``perceptual experts'' dedicated to distinct perceptual domains. A novel multi-proxy expert training strategy is designed to optimize each expert using a ranking-inspired loss, guided by the most suitable proxy metric for its domain. We also integrated a diagnostic multi-task head into this framework to generate a global quality score and an interpretable multi-dimensional artifact vector, which is optimized using a weakly-supervised learning strategy, leveraging the known properties of the large-scale training database generated for this work. With static model parameters (without retraining or fine-tuning), Unified-VQA demonstrates consistent and superior performance compared to over 18 benchmark methods for both generic VQA and diagnostic artifact detection tasks across 17 databases containing diverse streaming artifacts in HD, UHD, HDR and HFR formats. This work represents an important step towards practical, actionable, and interpretable video quality assessment.

Paper Structure

This paper contains 16 sections, 4 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: (Top) The difference between the proposed Unified-VQA framework and existing VQA models in terms of interpretability and practicality. (Bottom) Radar plots show the superior performance of Unified-VQA when compared to three well-performing quality models across four groups of databases containing compression artifacts (in HD and UHD formats), motion distortions and color artifacts for the VQA task (in SROCC and PLCC), when compared to three multi-artifact detectors (in F1-Score). Here large area indicates better overall performance.
  • Figure 2: The proposed Unified-VQA framework, which consists of a pre-trained feature extractor $f$, three lightweight perceptual experts networks (PEN), a spatio-temporal aggregator (STA) and a unified diagnostic head. It has been trained based on an expert-guided multi-task learning strategy.
  • Figure 3: Qualitative comparison between Unified-VQA (FR) and existing quality models, which shows that Unified-VQA aligns more closely with the human decisions for each content group.