Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment
Ziyu Shan, Yujie Zhang, Qi Yang, Haichen Yang, Yiling Xu, Jenq-Neng Hwang, Xiaozhong Xu, Shan Liu
TL;DR
The paper addresses the data scarcity challenge in NR-PCQA by presenting CoPA, a projection-based contrastive pre-training framework that learns quality-aware representations from unlabeled data. It generates anchors via local patch mixing on 2D projections and optimizes content-wise and distortion-wise contrasts with a momentum encoder, producing a robust encoder $\mathcal{F}$. In the downstream task, CoPA fine-tunes with labeled data using multi-view projections and a semantic-guided cross-attention fusion with a 2D backbone $\mathcal{G}$, guided by a final regression loss $\\mathcal{L}_{fine} = \alpha \\mathcal{L}_{mse} + (1-\alpha) \\mathcal{L}_{rank}$. Empirical results on LS-PCQA, SJTU-PCQA, and WPC show that CoPA achieves state-of-the-art NR-PCQA performance and improves generalization, while also benefiting existing projection-based NR-PCQA models.
Abstract
No-reference point cloud quality assessment (NR-PCQA) aims to automatically evaluate the perceptual quality of distorted point clouds without available reference, which have achieved tremendous improvements due to the utilization of deep neural networks. However, learning-based NR-PCQA methods suffer from the scarcity of labeled data and usually perform suboptimally in terms of generalization. To solve the problem, we propose a novel contrastive pre-training framework tailored for PCQA (CoPA), which enables the pre-trained model to learn quality-aware representations from unlabeled data. To obtain anchors in the representation space, we project point clouds with different distortions into images and randomly mix their local patches to form mixed images with multiple distortions. Utilizing the generated anchors, we constrain the pre-training process via a quality-aware contrastive loss following the philosophy that perceptual quality is closely related to both content and distortion. Furthermore, in the model fine-tuning stage, we propose a semantic-guided multi-view fusion module to effectively integrate the features of projected images from multiple perspectives. Extensive experiments show that our method outperforms the state-of-the-art PCQA methods on popular benchmarks. Further investigations demonstrate that CoPA can also benefit existing learning-based PCQA models.
