Table of Contents
Fetching ...

PAME: Self-Supervised Masked Autoencoder for No-Reference Point Cloud Quality Assessment

Ziyu Shan, Yujie Zhang, Qi Yang, Haichen Yang, Yiling Xu, Shan Liu

TL;DR

This work proposes a self-supervised pre-training framework using masked autoencoders (PAME) to help the model learn useful representations without labels, and outperforms the state-of-the-art NR-PCQA methods on popular benchmarks in terms of prediction accuracy and generalizability.

Abstract

No-reference point cloud quality assessment (NR-PCQA) aims to automatically predict the perceptual quality of point clouds without reference, which has achieved remarkable performance due to the utilization of deep learning-based models. However, these data-driven models suffer from the scarcity of labeled data and perform unsatisfactorily in cross-dataset evaluations. To address this problem, we propose a self-supervised pre-training framework using masked autoencoders (PAME) to help the model learn useful representations without labels. Specifically, after projecting point clouds into images, our PAME employs dual-branch autoencoders, reconstructing masked patches from distorted images into the original patches within reference and distorted images. In this manner, the two branches can separately learn content-aware features and distortion-aware features from the projected images. Furthermore, in the model fine-tuning stage, the learned content-aware features serve as a guide to fuse the point cloud quality features extracted from different perspectives. Extensive experiments show that our method outperforms the state-of-the-art NR-PCQA methods on popular benchmarks in terms of prediction accuracy and generalizability.

PAME: Self-Supervised Masked Autoencoder for No-Reference Point Cloud Quality Assessment

TL;DR

This work proposes a self-supervised pre-training framework using masked autoencoders (PAME) to help the model learn useful representations without labels, and outperforms the state-of-the-art NR-PCQA methods on popular benchmarks in terms of prediction accuracy and generalizability.

Abstract

No-reference point cloud quality assessment (NR-PCQA) aims to automatically predict the perceptual quality of point clouds without reference, which has achieved remarkable performance due to the utilization of deep learning-based models. However, these data-driven models suffer from the scarcity of labeled data and perform unsatisfactorily in cross-dataset evaluations. To address this problem, we propose a self-supervised pre-training framework using masked autoencoders (PAME) to help the model learn useful representations without labels. Specifically, after projecting point clouds into images, our PAME employs dual-branch autoencoders, reconstructing masked patches from distorted images into the original patches within reference and distorted images. In this manner, the two branches can separately learn content-aware features and distortion-aware features from the projected images. Furthermore, in the model fine-tuning stage, the learned content-aware features serve as a guide to fuse the point cloud quality features extracted from different perspectives. Extensive experiments show that our method outperforms the state-of-the-art NR-PCQA methods on popular benchmarks in terms of prediction accuracy and generalizability.
Paper Structure (12 sections, 8 equations, 3 figures, 3 tables)

This paper contains 12 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Framework of the proposed pre-training method (PAME). Unlabeled point cloud is first projected into images, and the images are then partitioned into patches and masked 50%. Then the visible unmasked patches are embedded by convolutional layers along with positional embedding before being fed to two vision transformers. Subsequently, the encoded features are decoded to predict the masked patches and the corresponding patches of the projected images rendered from the reference point cloud. The content-aware and distortion-aware branch are color-marked in BLUE and RED.
  • Figure 2: Framework of the fine-tuning stage. The labeled point cloud is rendered and patchfied, and the patches are then embedded. After encoded by $\mathcal{F}$ and $\mathcal{G}$, the features in content-aware branch are maxpooled to guide the fusion of distortion-aware features using the cross-attention mechanism.
  • Figure 3: PLCCs of the NR-PCQA methods with less labeled data on SJTU-PCQA.