Modular Blind Video Quality Assessment

Wen Wen; Mu Li; Yabin Zhang; Yiting Liao; Junlin Li; Li Zhang; Kede Ma

Modular Blind Video Quality Assessment

Wen Wen, Mu Li, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang, Kede Ma

TL;DR

The paper addresses blind video quality assessment under varying spatial resolutions and frame rates by introducing a modular BVQA architecture consisting of a base quality predictor, a spatial rectifier, and a temporal rectifier. During training, a dropout-based strategy encourages the base predictor to operate independently, enabling clear attribution of quality changes to spatial or temporal factors. The model outputs $q_b$, $q_s$, $q_t$, and a final $q_{st}$ through principled aggregation, achieving superior or competitive results across 14 PGC/UGC datasets and enabling analysis of dataset spatial/temporal complexity. The approach demonstrates strong cross-dataset generalization and provides a flexible framework for extending BVQA to additional video attributes, with practical impact for streaming and user-generated content platforms.

Abstract

Blind video quality assessment (BVQA) plays a pivotal role in evaluating and improving the viewing experience of end-users across a wide range of video-based platforms and services. Contemporary deep learning-based models primarily analyze video content in its aggressively subsampled format, while being blind to the impact of the actual spatial resolution and frame rate on video quality. In this paper, we propose a modular BVQA model and a method of training it to improve its modularity. Our model comprises a base quality predictor, a spatial rectifier, and a temporal rectifier, responding to the visual content and distortion, spatial resolution, and frame rate changes on video quality, respectively. During training, spatial and temporal rectifiers are dropped out with some probabilities to render the base quality predictor a standalone BVQA model, which should work better with the rectifiers. Extensive experiments on both professionally-generated content and user-generated content video databases show that our quality model achieves superior or comparable performance to current methods. Additionally, the modularity of our model offers an opportunity to analyze existing video quality databases in terms of their spatial and temporal complexity.

Modular Blind Video Quality Assessment

TL;DR

, and a final

through principled aggregation, achieving superior or competitive results across 14 PGC/UGC datasets and enabling analysis of dataset spatial/temporal complexity. The approach demonstrates strong cross-dataset generalization and provides a flexible framework for extending BVQA to additional video attributes, with practical impact for streaming and user-generated content platforms.

Abstract

Paper Structure (16 sections, 7 equations, 3 figures, 7 tables)

This paper contains 16 sections, 7 equations, 3 figures, 7 tables.

Introduction
Related Work
BVQA Models for PGC Content
BVQA Models for UGC Content
Proposed Method
Base Quality Predictor
Spatial Rectifier
Temporal Rectifier
Module Aggregation
Experiment
Experimental Setups
Results on PGC Datasets
Results on UGC Datasets
Ablation Studies
Conclusion and Discussion
...and 1 more sections

Figures (3)

Figure 1: Conventional ways of pre-processing videos in spatial view. (a) Videos with the same content but different spatial resolutions taken from the Waterloo-IVC-4K dataset li2019avc. (b) Resizing without maintaining the aspect ratio leads to geometric distortions of structures and textures. (c) Aspect ratio-preserving resizing and cropping results in almost identical inputs of fixed size. (d) Cropping from videos at the actual spatial resolution reduces the field of view with limited content coverage.
Figure 2: Two videos from the LIVE-YT-HFR dataset madhusudana2021subjective with identical content and duration but different frame rates. When the subsampling rate is proportional to the frame rate, the remaining frames are identical.
Figure 3: System diagram of our modular BVQA model. The base quality predictor takes a sparse set of spatially downsampled key frames as input, and generates a base quality score denoted by $q_b$. The spatial rectifier employs Laplacian pyramids derived from the key frames at their actual spatial resolution, and computes a scaling parameter $\alpha_s$ and a shift parameter $\beta_s$ to rectify the base quality score. The temporal rectifier leverages features from the video chunks centered around the key frames at the actual frame rate to compute another scaling parameter $\alpha_t$ and shift parameter $\beta_t$ for quality rectification.

Modular Blind Video Quality Assessment

TL;DR

Abstract

Modular Blind Video Quality Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (3)