Learning Perceptual Representations for Gaming NR-VQA with Multi-Task FR Signals
Yu-Chih Chen, Michael Wang, Chieh-Dun Wen, Kai-Siang Ma, Avinab Saha, Li-Heng Chen, Alan Bovik
TL;DR
This work tackles no-reference VQA for gaming by introducing MTL-VQA, a multi-task pretraining framework that learns perceptual representations from multiple FR metrics without human labels. A shared encoder is trained with adaptive gradient weighting (MGDA/MinNormSolver) across several FR objectives, then frozen to support efficient NR-VQA via a lightweight SVR on temporally pooled features. The approach demonstrates strong label-efficient transfer across gaming datasets, including promising few-shot MOS calibration with as few as 100 labeled clips and competitive performance under PGC-to-UGC domain shifts. It enables practical cloud-gaming QoE monitoring with low-latency NR predictions and points to future work on HUD-aware masking and more robust temporal/artifact-aware auxiliary tasks to improve robustness.
Abstract
No-reference video quality assessment (NR-VQA) for gaming videos is challenging due to limited human-rated datasets and unique content characteristics including fast motion, stylized graphics, and compression artifacts. We present MTL-VQA, a multi-task learning framework that uses full-reference metrics as supervisory signals to learn perceptually meaningful features without human labels for pretraining. By jointly optimizing multiple full-reference (FR) objectives with adaptive task weighting, our approach learns shared representations that transfer effectively to NR-VQA. Experiments on gaming video datasets show MTL-VQA achieves performance competitive with state-of-the-art NR-VQA methods across both MOS-supervised and label-efficient/self-supervised settings.
