ReLaX-VQA: Residual Fragment and Layer Stack Extraction for Enhancing Video Quality Assessment
Xinyi Wang, Angeliki Katsenou, David Bull
TL;DR
ReLaX-VQA tackles the challenge of No-Reference Video Quality Assessment for diverse User-Generated Content by combining selective spatio-temporal fragment sampling with layer-stacked deep features from ResNet-50 and ViT. The framework comprises three modules: Spatio-Temporal Fragment Sampling to extract salient RFs/MF from frame differences and optical flow, DNN Feature Extraction with multi-layer fusion, and a lightweight MLP regressor trained with a composite MAE and Rank loss. Empirically, it achieves state-of-the-art or competitive performance across four NR-VQA benchmarks and the large-scale LSVQ, especially when fine-tuned, with strong generalization across resolutions. The work demonstrates that focusing on high-variability spatio-temporal regions and combining local/global feature representations yields robust NR-VQA results, offering open-source code and pretrained models for broader adoption.
Abstract
With the rapid growth of User-Generated Content (UGC) exchanged between users and sharing platforms, the need for video quality assessment in the wild is increasingly evident. UGC is typically acquired using consumer devices and undergoes multiple rounds of compression (transcoding) before reaching the end user. Therefore, traditional quality metrics that employ the original content as a reference are not suitable. In this paper, we propose ReLaX-VQA, a novel No-Reference Video Quality Assessment (NR-VQA) model that aims to address the challenges of evaluating the quality of diverse video content without reference to the original uncompressed videos. ReLaX-VQA uses frame differences to select spatio-temporal fragments intelligently together with different expressions of spatial features associated with the sampled frames. These are then used to better capture spatial and temporal variabilities in the quality of neighbouring frames. Furthermore, the model enhances abstraction by employing layer-stacking techniques in deep neural network features from Residual Networks and Vision Transformers. Extensive testing across four UGC datasets demonstrates that ReLaX-VQA consistently outperforms existing NR-VQA methods, achieving an average SRCC of 0.8658 and PLCC of 0.8873. Open-source code and trained models that will facilitate further research and applications of NR-VQA can be found at https://github.com/xinyiW915/ReLaX-VQA.
