Table of Contents
Fetching ...

NVS-SQA: Exploring Self-Supervised Quality Representation Learning for Neurally Synthesized Scenes without References

Qiang Qu, Yiran Shen, Xiaoming Chen, Yuk Ying Chung, Weidong Cai, Tongliang Liu

TL;DR

NVS-SQA addresses the lack of dense references and scarce human labels in neurally synthesized scene quality assessment by introducing a no-reference, self-supervised framework. It learns perceptual quality representations through NSS-specific contrastive pair preparation and a multi-branch guidance scheme (IQA, VQA, REP) implemented in AdaptiSceneNet, followed by a linear mapper to human scores. The approach demonstrates strong cross-dataset generalization, outperforming numerous no-reference methods and rivaling several full-reference metrics across Fieldwork, LLFF, and Lab datasets, with robust performance in extreme conditions and efficient inference. The work also provides an open-source benchmark and tools to catalyze future research in NSS quality learning and evaluation.

Abstract

Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly due to the limited availability of dense reference views. Furthermore, the challenges in acquiring human perceptual labels hinder the creation of extensive labeled datasets, risking model overfitting and reduced generalizability. To address these issues, we propose NVS-SQA, a NSS quality assessment method to learn no-reference quality representations through self-supervision without reliance on human labels. Traditional self-supervised learning predominantly relies on the "same instance, similar representation" assumption and extensive datasets. However, given that these conditions do not apply in NSS quality assessment, we employ heuristic cues and quality scores as learning objectives, along with a specialized contrastive pair preparation process to improve the effectiveness and efficiency of learning. The results show that NVS-SQA outperforms 17 no-reference methods by a large margin (i.e., on average 109.5% in SRCC, 98.6% in PLCC, and 91.5% in KRCC over the second best) and even exceeds 16 full-reference methods across all evaluation metrics (i.e., 22.9% in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).

NVS-SQA: Exploring Self-Supervised Quality Representation Learning for Neurally Synthesized Scenes without References

TL;DR

NVS-SQA addresses the lack of dense references and scarce human labels in neurally synthesized scene quality assessment by introducing a no-reference, self-supervised framework. It learns perceptual quality representations through NSS-specific contrastive pair preparation and a multi-branch guidance scheme (IQA, VQA, REP) implemented in AdaptiSceneNet, followed by a linear mapper to human scores. The approach demonstrates strong cross-dataset generalization, outperforming numerous no-reference methods and rivaling several full-reference metrics across Fieldwork, LLFF, and Lab datasets, with robust performance in extreme conditions and efficient inference. The work also provides an open-source benchmark and tools to catalyze future research in NSS quality learning and evaluation.

Abstract

Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly due to the limited availability of dense reference views. Furthermore, the challenges in acquiring human perceptual labels hinder the creation of extensive labeled datasets, risking model overfitting and reduced generalizability. To address these issues, we propose NVS-SQA, a NSS quality assessment method to learn no-reference quality representations through self-supervision without reliance on human labels. Traditional self-supervised learning predominantly relies on the "same instance, similar representation" assumption and extensive datasets. However, given that these conditions do not apply in NSS quality assessment, we employ heuristic cues and quality scores as learning objectives, along with a specialized contrastive pair preparation process to improve the effectiveness and efficiency of learning. The results show that NVS-SQA outperforms 17 no-reference methods by a large margin (i.e., on average 109.5% in SRCC, 98.6% in PLCC, and 91.5% in KRCC over the second best) and even exceeds 16 full-reference methods across all evaluation metrics (i.e., 22.9% in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).
Paper Structure (20 sections, 7 equations, 9 figures, 9 tables, 2 algorithms)

This paper contains 20 sections, 7 equations, 9 figures, 9 tables, 2 algorithms.

Figures (9)

  • Figure 1: Which synthesized scene is better (top or bottom)? The existing quality assessment methods display deviations from human judgment, including full-reference image (PSNR, SSIM, LPIPS) and video (VMAF, FovVideoVDP) quality assessment methods, no-reference image (BRISQUE, Re-IQA) and video (Video-BLIINDS, DOVER) quality assessment methods, and light-field quality assessment methods (ALAS-DADS, LFACon). Uniquely, the proposed method mirrors human subjective evaluations without references, trained with self-supervised learning.
  • Figure 2: Why no-reference? Full-reference quality assessment methods (e.g., PSNR, SSIM) face a dilemma between the unlimited number of views in NSS and the limited availability of recorded references, whereas no-reference methods address this challenge.
  • Figure 3: Why self-supervised learning in NSS quality assessment? Traditional learning-based quality assessment methods (left) require costly retraining for different domains (e.g., datasets, quality assessment protocols) and can easily overfit in the case of limited human perceptual labels. In contrast, self-supervised learning-based methods (right) learn generalized quality representations once from unlabeled data, allowing them to easily adapt to new domains without overfitting.
  • Figure 4: Overview of the proposed learning framework. The framework consists of two main stages: (1) The self-supervised quality representation learning stage, where AdaptiSceneNet (described in Section \ref{['subsec:adaptiscenenet']}) learns quality representations from unlabeled NSS through a contrastive pair preparation (detailed in Section \ref{['subsec:contrastive_pair_preparation']}) process and multi-branch guidance adaptation for quality representation learning (described in Section \ref{['subsec:multibranch_contrastive_objective']}), and (2) The perceptual quality estimation stage, where the pretrained AdaptiSceneNet is used to estimate perceptual quality by mapping learned representations to human scores.
  • Figure 5: Four clips from the same NSS but different quality level. (zoom in for a clearer view)
  • ...and 4 more figures