Table of Contents
Fetching ...

Point-DAE: Denoising Autoencoders for Self-supervised Point Cloud Learning

Yabin Zhang, Jiehong Lin, Ruihuang Li, Kui Jia, Lei Zhang

TL;DR

Point-DAE broadens self-supervised learning for 3D point clouds by treating corruption as a denoising task and systematically studying 14 corruption types across density, noise, and affine transformations. A key contribution is the identification of global affine transformation as a complementary corruption to masking, and, for Transformer backbones, a reconstruction decomposition that separates local patch recovery from global shape reconstruction to avoid position leakage. The method is validated across diverse downstream tasks—classification, segmentation, robustness, few-shot learning, and 3D object detection—using multiple backbones, showing consistent improvements over strong baselines. Overall, Point-DAE demonstrates that combining global affine distortions with local masking yields robust, transferable representations from unlabeled 3D data, with broad potential applicability to other modalities.

Abstract

Masked autoencoder has demonstrated its effectiveness in self-supervised point cloud learning. Considering that masking is a kind of corruption, in this work we explore a more general denoising autoencoder for point cloud learning (Point-DAE) by investigating more types of corruptions beyond masking. Specifically, we degrade the point cloud with certain corruptions as input, and learn an encoder-decoder model to reconstruct the original point cloud from its corrupted version. Three corruption families (\ie, density/masking, noise, and affine transformation) and a total of fourteen corruption types are investigated with traditional non-Transformer encoders. Besides the popular masking corruption, we identify another effective corruption family, \ie, affine transformation. The affine transformation disturbs all points globally, which is complementary to the masking corruption where some local regions are dropped. We also validate the effectiveness of affine transformation corruption with the Transformer backbones, where we decompose the reconstruction of the complete point cloud into the reconstructions of detailed local patches and rough global shape, alleviating the position leakage problem in the reconstruction. Extensive experiments on tasks of object classification, few-shot learning, robustness testing, part segmentation, and 3D object detection validate the effectiveness of the proposed method. The codes are available at \url{https://github.com/YBZh/Point-DAE}.

Point-DAE: Denoising Autoencoders for Self-supervised Point Cloud Learning

TL;DR

Point-DAE broadens self-supervised learning for 3D point clouds by treating corruption as a denoising task and systematically studying 14 corruption types across density, noise, and affine transformations. A key contribution is the identification of global affine transformation as a complementary corruption to masking, and, for Transformer backbones, a reconstruction decomposition that separates local patch recovery from global shape reconstruction to avoid position leakage. The method is validated across diverse downstream tasks—classification, segmentation, robustness, few-shot learning, and 3D object detection—using multiple backbones, showing consistent improvements over strong baselines. Overall, Point-DAE demonstrates that combining global affine distortions with local masking yields robust, transferable representations from unlabeled 3D data, with broad potential applicability to other modalities.

Abstract

Masked autoencoder has demonstrated its effectiveness in self-supervised point cloud learning. Considering that masking is a kind of corruption, in this work we explore a more general denoising autoencoder for point cloud learning (Point-DAE) by investigating more types of corruptions beyond masking. Specifically, we degrade the point cloud with certain corruptions as input, and learn an encoder-decoder model to reconstruct the original point cloud from its corrupted version. Three corruption families (\ie, density/masking, noise, and affine transformation) and a total of fourteen corruption types are investigated with traditional non-Transformer encoders. Besides the popular masking corruption, we identify another effective corruption family, \ie, affine transformation. The affine transformation disturbs all points globally, which is complementary to the masking corruption where some local regions are dropped. We also validate the effectiveness of affine transformation corruption with the Transformer backbones, where we decompose the reconstruction of the complete point cloud into the reconstructions of detailed local patches and rough global shape, alleviating the position leakage problem in the reconstruction. Extensive experiments on tasks of object classification, few-shot learning, robustness testing, part segmentation, and 3D object detection validate the effectiveness of the proposed method. The codes are available at \url{https://github.com/YBZh/Point-DAE}.
Paper Structure (12 sections, 18 equations, 11 figures, 15 tables)

This paper contains 12 sections, 18 equations, 11 figures, 15 tables.

Figures (11)

  • Figure 1: Visualizations of masking (e.g., Drop-Local) and affine transformation (e.g., Affine) corruptions and the corresponding Chamfer Distance (CD) to the clean input, where the reported CD values are averaged over the ShapeNet training set. The ACC reports the classification accuracy on downstream ScanObjectNN dataset, as detailed in Tab. \ref{['Tab:results_corruptions']}.
  • Figure 2: Illustration of the $14$ corruptions studied in this work, which can be classified into three corruption families, i.e., density/masking, noise, and affine transformation. Please refer to the Supplementary Material for more detailed implementation of these corruptions.
  • Figure 3: Transformation matrices corresponding to different sub-transformations of the affine transformation family. The parameters $\theta_x, \theta_y, \theta_z \in [-\pi, \pi]$ respectively determine the rotation angles around the X,Y, and Z axes in the Rotate corruption. Rotate-Z is a subset of the Rotate transformation, where the point cloud is rotated around the Z axis only, with $\theta_x, \theta_y=0$ and $\theta_z \in [-\pi, \pi]$. The parameters $\tau_x, \tau_y, \tau_z \in \mathbb{R}$ respectively decide the translate magnitude along the X,Y, and Z axes in the Translate corruption. For the Reflect transformation, we set $r_x, r_y, r_z \in \{ -1, 1\}$, and reflect the point cloud along the X/Y/Z axis with $r_x/r_y/r_z = -1$. The parameters $s_{xy}, s_{xz}, s_{yx}, s_{yz}, s_{zx}, s_{zy} \in \mathbb{R}$ determine the shear magnitude and $s_x, s_y, s_z \in \mathbb{R}^+$ decide the scale intensity in Shear and Scale corruptions, respectively. The full Affine corruption is obtained by combining these sub-transformations, with its transformation matrix resulting from the multiplication of the matrices of these sub-ones. We conduct detailed analyses on these hyper-parameters in the Supplementary Material.
  • Figure 4: An overview of Point-DAE with non-Transformer backbones (upper part) and Transformer backbones (lower part), where we use a toy rotation operation to represent the affine transformation for visualization.
  • Figure 5: Loss curves of (a) $\mathcal{L}_{global}$ and (b) $\mathcal{L}_{local}$ with different learning objectives.
  • ...and 6 more figures