Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment

Ziyu Shan; Yujie Zhang; Qi Yang; Haichen Yang; Yiling Xu; Jenq-Neng Hwang; Xiaozhong Xu; Shan Liu

Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment

Ziyu Shan, Yujie Zhang, Qi Yang, Haichen Yang, Yiling Xu, Jenq-Neng Hwang, Xiaozhong Xu, Shan Liu

TL;DR

The paper addresses the data scarcity challenge in NR-PCQA by presenting CoPA, a projection-based contrastive pre-training framework that learns quality-aware representations from unlabeled data. It generates anchors via local patch mixing on 2D projections and optimizes content-wise and distortion-wise contrasts with a momentum encoder, producing a robust encoder $\mathcal{F}$. In the downstream task, CoPA fine-tunes with labeled data using multi-view projections and a semantic-guided cross-attention fusion with a 2D backbone $\mathcal{G}$, guided by a final regression loss $\\mathcal{L}_{fine} = \alpha \\mathcal{L}_{mse} + (1-\alpha) \\mathcal{L}_{rank}$. Empirical results on LS-PCQA, SJTU-PCQA, and WPC show that CoPA achieves state-of-the-art NR-PCQA performance and improves generalization, while also benefiting existing projection-based NR-PCQA models.

Abstract

No-reference point cloud quality assessment (NR-PCQA) aims to automatically evaluate the perceptual quality of distorted point clouds without available reference, which have achieved tremendous improvements due to the utilization of deep neural networks. However, learning-based NR-PCQA methods suffer from the scarcity of labeled data and usually perform suboptimally in terms of generalization. To solve the problem, we propose a novel contrastive pre-training framework tailored for PCQA (CoPA), which enables the pre-trained model to learn quality-aware representations from unlabeled data. To obtain anchors in the representation space, we project point clouds with different distortions into images and randomly mix their local patches to form mixed images with multiple distortions. Utilizing the generated anchors, we constrain the pre-training process via a quality-aware contrastive loss following the philosophy that perceptual quality is closely related to both content and distortion. Furthermore, in the model fine-tuning stage, we propose a semantic-guided multi-view fusion module to effectively integrate the features of projected images from multiple perspectives. Extensive experiments show that our method outperforms the state-of-the-art PCQA methods on popular benchmarks. Further investigations demonstrate that CoPA can also benefit existing learning-based PCQA models.

Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment

TL;DR

. In the downstream task, CoPA fine-tunes with labeled data using multi-view projections and a semantic-guided cross-attention fusion with a 2D backbone

, guided by a final regression loss

. Empirical results on LS-PCQA, SJTU-PCQA, and WPC show that CoPA achieves state-of-the-art NR-PCQA performance and improves generalization, while also benefiting existing projection-based NR-PCQA models.

Abstract

Paper Structure (15 sections, 9 equations, 4 figures, 4 tables)

This paper contains 15 sections, 9 equations, 4 figures, 4 tables.

Introduction
Related Works
No-Reference Point Cloud Quality Assessment
Contrastive Learning
Method
Overview
Contrastive Pre-Training
Fine-Tuning with Labeled Data
Quality Regression and Loss Function
Experiments
Datasets and Evaluation Metrics
Implementation Details
Comparison with State-of-the-art Methods
Ablation Studies
Conclusion

Figures (4)

Figure 1: Illustration of our contrastive pre-training framework (CoPA). CoPA first generates anchor by randomly patch-mixing the projected images from a point cloud with different distortions, and then pre-trains the model by pulling positive samples closer to the anchor in the representation space, while pushing distortion-wise and content-wise negative samples apart.
Figure 2: Framework of the proposed method. The framework mainly consists of two stages: (a) Contrastive pre-training. Unlabeled point clouds are projected into single-view images, and anchors are generated by local patch mixing. A single-view image encoder $\mathcal{F}$ is pre-trained by pulling the defined positive samples to the anchors in the representation space and pushing the negative samples apart. (b) Fine-tuning with labeled data. The labeled point cloud is projected into multi-view images encoded by the pre-trained encoder $\mathcal{F}$. Then, the multi-view images are stitched to extract semantic features through a 2D backbone $\mathcal{G}$, which guides the fusion of multi-view images using the cross-attention mechanism. Finally, the quality score is regressed by the fully-connected layers.
Figure 3: T-SNE embedding of the representation spaces of PQA-Net, GPA-Net, our model without pre-training, and our complete model on testing set of SJTU-PCQA. The scattered points are color and shape encoded according to distortion type and content.
Figure 4: PLCCs of the NR-PCQA methods with less labeled data on SJTU-PCQA. Our pre-trained model outperforms the compared methods by a large margin when the training data is limited.

Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment

TL;DR

Abstract

Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (4)