Table of Contents
Fetching ...

Adaptive Score Alignment Learning for Continual Perceptual Quality Assessment of 360-Degree Videos in Virtual Reality

Kanglei Zhou, Zikai Hao, Liyuan Wang, Xiaohui Liang

TL;DR

The paper tackles VR-VQA for 360-degree videos, where balancing alignment with human judgments and predictive precision is challenged by non-stationary content distributions. It introduces Adaptive Score Alignment Learning (ASAL), which fuses correlation and error losses and uses feature-space smoothing to improve generalization, and extends ASAL with adaptive memory replay for Continual VR-VQA under VR-device constraints. A key part of ASAL is a memory-efficient mechanism using key-frame extraction and a latent-space feature adapter to reconstruct full video features for replay, plus a re-parameterization technique to stabilize learning. A new benchmark with dataset splits and metrics demonstrates that ASAL yields substantial gains in both static joint training and dynamic continual learning settings, underscoring its practical viability for robust, continual perceptual quality assessment in VR environments.

Abstract

Virtual Reality Video Quality Assessment (VR-VQA) aims to evaluate the perceptual quality of 360-degree videos, which is crucial for ensuring a distortion-free user experience. Traditional VR-VQA methods trained on static datasets with limited distortion diversity struggle to balance correlation and precision. This becomes particularly critical when generalizing to diverse VR content and continually adapting to dynamic and evolving video distribution variations. To address these challenges, we propose a novel approach for assessing the perceptual quality of VR videos, Adaptive Score Alignment Learning (ASAL). ASAL integrates correlation loss with error loss to enhance alignment with human subjective ratings and precision in predicting perceptual quality. In particular, ASAL can naturally adapt to continually changing distributions through a feature space smoothing process that enhances generalization to unseen content. To further improve continual adaptation to dynamic VR environments, we extend ASAL with adaptive memory replay as a novel Continul Learning (CL) framework. Unlike traditional CL models, ASAL utilizes key frame extraction and feature adaptation to address the unique challenges of non-stationary variations with both the computation and storage restrictions of VR devices. We establish a comprehensive benchmark for VR-VQA and its CL counterpart, introducing new data splits and evaluation metrics. Our experiments demonstrate that ASAL outperforms recent strong baseline models, achieving overall correlation gains of up to 4.78\% in the static joint training setting and 12.19\% in the dynamic CL setting on various datasets. This validates the effectiveness of ASAL in addressing the inherent challenges of VR-VQA.Our code is available at https://github.com/ZhouKanglei/ASAL_CVQA.

Adaptive Score Alignment Learning for Continual Perceptual Quality Assessment of 360-Degree Videos in Virtual Reality

TL;DR

The paper tackles VR-VQA for 360-degree videos, where balancing alignment with human judgments and predictive precision is challenged by non-stationary content distributions. It introduces Adaptive Score Alignment Learning (ASAL), which fuses correlation and error losses and uses feature-space smoothing to improve generalization, and extends ASAL with adaptive memory replay for Continual VR-VQA under VR-device constraints. A key part of ASAL is a memory-efficient mechanism using key-frame extraction and a latent-space feature adapter to reconstruct full video features for replay, plus a re-parameterization technique to stabilize learning. A new benchmark with dataset splits and metrics demonstrates that ASAL yields substantial gains in both static joint training and dynamic continual learning settings, underscoring its practical viability for robust, continual perceptual quality assessment in VR environments.

Abstract

Virtual Reality Video Quality Assessment (VR-VQA) aims to evaluate the perceptual quality of 360-degree videos, which is crucial for ensuring a distortion-free user experience. Traditional VR-VQA methods trained on static datasets with limited distortion diversity struggle to balance correlation and precision. This becomes particularly critical when generalizing to diverse VR content and continually adapting to dynamic and evolving video distribution variations. To address these challenges, we propose a novel approach for assessing the perceptual quality of VR videos, Adaptive Score Alignment Learning (ASAL). ASAL integrates correlation loss with error loss to enhance alignment with human subjective ratings and precision in predicting perceptual quality. In particular, ASAL can naturally adapt to continually changing distributions through a feature space smoothing process that enhances generalization to unseen content. To further improve continual adaptation to dynamic VR environments, we extend ASAL with adaptive memory replay as a novel Continul Learning (CL) framework. Unlike traditional CL models, ASAL utilizes key frame extraction and feature adaptation to address the unique challenges of non-stationary variations with both the computation and storage restrictions of VR devices. We establish a comprehensive benchmark for VR-VQA and its CL counterpart, introducing new data splits and evaluation metrics. Our experiments demonstrate that ASAL outperforms recent strong baseline models, achieving overall correlation gains of up to 4.78\% in the static joint training setting and 12.19\% in the dynamic CL setting on various datasets. This validates the effectiveness of ASAL in addressing the inherent challenges of VR-VQA.Our code is available at https://github.com/ZhouKanglei/ASAL_CVQA.

Paper Structure

This paper contains 39 sections, 15 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overview of our ASAL method: (a) At the end of task $t$, representative samples are selected and stored in the memory bank. Given the limited storage capacity of VR devices, we further extract key frames from these samples to ensure efficient memory utilization. (b) During the current task $t$, the current session data and the replay data are processed alternately. For replay, a mini-batch is drawn from the memory bank to extract relevant features. A feature adapter is then used to reconstruct the video sequence before regression. The regressor employs a re-parameterization technique to enhance robustness, and the outputs are optimized using two loss terms to ensure alignment with human assessments.
  • Figure 2: Scatter plots of normalized spatial versus normalized temporal information across different data splits: (a-e) represent sessions 1 to 5, while (f) corresponds to the base session split used for pre-training.
  • Figure 3: Performance comparison plots of the joint training model with varying loss weights: (a) SRCC ($\uparrow$), (b) RL2E ($\downarrow$).
  • Figure 4: Line plots for the impact of representative samples per session on the overall performance metrics ($\mathrm{SRCC_{ove}}$ and $\mathrm{RL2E_{ove}}$).
  • Figure 5: Performance comparison heatmaps of our model with varying key frames: (a) $\mathrm{SRCC_{ove}}$ ($\uparrow$), (b) $\mathrm{RL2E_{ove}}$ ($\downarrow$).
  • ...and 4 more figures