Table of Contents
Fetching ...

Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform

Chunghyun Park, Seungwook Kim, Jaesik Park, Minsu Cho

TL;DR

This work tackles the problem of establishing dense 3D semantic correspondences between shapes under arbitrary rotations. It introduces RIST, a self-supervised framework that uses an SO(3)-equivariant encoder to produce a global descriptor $Z\in\mathbb{R}^{C\times3}$ and per-point, dynamic SO(3)-invariant local shape transforms $f_{\theta_i}$ to map $Z$ to local descriptors, enabling self- and cross-reconstruction with a rotation-robust decoder. By supervising these reconstructions, RIST aligns semantically corresponding points through similar local descriptors, yielding state-of-the-art performance on rotated 3D part label transfer and 3D keypoint transfer on ShapeNetPart, ScanObjectNN, and KeypointNet. The method demonstrates strong rotation robustness and generalization to real-world data, offering a practical pathway to dense 3D annotation and downstream 3D understanding tasks. Key innovations include the dynamic, per-point local shape transform and the use of SO(3)-equivariant/invariant representations to guarantee rotational robustness.$Z \in \mathbb{R}^{C\times 3}$, $f_{\theta_i}: \mathbb{R}^{C\times3} \to \mathbb{R}^{C'\times3}$.

Abstract

Establishing accurate 3D correspondences between shapes stands as a pivotal challenge with profound implications for computer vision and robotics. However, existing self-supervised methods for this problem assume perfect input shape alignment, restricting their real-world applicability. In this work, we introduce a novel self-supervised Rotation-Invariant 3D correspondence learner with Local Shape Transform, dubbed RIST, that learns to establish dense correspondences between shapes even under challenging intra-class variations and arbitrary orientations. Specifically, RIST learns to dynamically formulate an SO(3)-invariant local shape transform for each point, which maps the SO(3)-equivariant global shape descriptor of the input shape to a local shape descriptor. These local shape descriptors are provided as inputs to our decoder to facilitate point cloud self- and cross-reconstruction. Our proposed self-supervised training pipeline encourages semantically corresponding points from different shapes to be mapped to similar local shape descriptors, enabling RIST to establish dense point-wise correspondences. RIST demonstrates state-of-the-art performances on 3D part label transfer and semantic keypoint transfer given arbitrarily rotated point cloud pairs, outperforming existing methods by significant margins.

Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform

TL;DR

This work tackles the problem of establishing dense 3D semantic correspondences between shapes under arbitrary rotations. It introduces RIST, a self-supervised framework that uses an SO(3)-equivariant encoder to produce a global descriptor and per-point, dynamic SO(3)-invariant local shape transforms to map to local descriptors, enabling self- and cross-reconstruction with a rotation-robust decoder. By supervising these reconstructions, RIST aligns semantically corresponding points through similar local descriptors, yielding state-of-the-art performance on rotated 3D part label transfer and 3D keypoint transfer on ShapeNetPart, ScanObjectNN, and KeypointNet. The method demonstrates strong rotation robustness and generalization to real-world data, offering a practical pathway to dense 3D annotation and downstream 3D understanding tasks. Key innovations include the dynamic, per-point local shape transform and the use of SO(3)-equivariant/invariant representations to guarantee rotational robustness., .

Abstract

Establishing accurate 3D correspondences between shapes stands as a pivotal challenge with profound implications for computer vision and robotics. However, existing self-supervised methods for this problem assume perfect input shape alignment, restricting their real-world applicability. In this work, we introduce a novel self-supervised Rotation-Invariant 3D correspondence learner with Local Shape Transform, dubbed RIST, that learns to establish dense correspondences between shapes even under challenging intra-class variations and arbitrary orientations. Specifically, RIST learns to dynamically formulate an SO(3)-invariant local shape transform for each point, which maps the SO(3)-equivariant global shape descriptor of the input shape to a local shape descriptor. These local shape descriptors are provided as inputs to our decoder to facilitate point cloud self- and cross-reconstruction. Our proposed self-supervised training pipeline encourages semantically corresponding points from different shapes to be mapped to similar local shape descriptors, enabling RIST to establish dense point-wise correspondences. RIST demonstrates state-of-the-art performances on 3D part label transfer and semantic keypoint transfer given arbitrarily rotated point cloud pairs, outperforming existing methods by significant margins.
Paper Structure (26 sections, 2 equations, 13 figures, 11 tables, 1 algorithm)

This paper contains 26 sections, 2 equations, 13 figures, 11 tables, 1 algorithm.

Figures (13)

  • Figure 1: Semantic correspondence between rotated shapes. We visualize the semantic correspondence results of the previous SOTA cheng2021learning and ours, given two randomly rotated airplanes from the ShapeNetPart dataset shapenetpart. Green and red lines indicate the correct and incorrect matches, respectively. For each method, 100 source points are randomly selected from the source (yellow) for correspondence visualization. Ours predicts SO(3)-invariant correspondences, showing superior accuracy and robustness in comparison to the previous SOTA cheng2021learning.
  • Figure 2: Overview: Self-supervised training of RIST. The input point clouds are independently encoded to SO(3)-equivariant global shape descriptor $\mathbf{Z}$ and dynamic SO(3)-invariant point-wise local shape transforms $\{f_{\theta_i}\}$. The local shape transforms map the global shape descriptor to local shape descriptors by infusing local semantics and geometry, which are used as inputs to the decoder for self-reconstruction. For cross-reconstruction, we apply the local shape transforms formulated from another point cloud to reconstruct the point cloud, ensuring that the local shape descriptors successfully capture generalizable local semantics and geometries. We supervise RIST via penalizing errors in self- and cross-reconstructions. At inference, we can leverage the local shape transforms for obtaining local shape descriptors, to identify the dense correspondences.
  • Figure 3: Qualitative results of part label transfer on the ShapeNetPart dataset shapenetpart. We visualize the label transfer results via learned correspondences of each method with the ground truth labels of targets. Note that the input shapes were arbitrarily rotated at evaluation, differently for both the source and targets of each row, but have been aligned in the above figure for better visibility of part label transfer results. RIST shows to outperform CPAE cheng2021learning consistently, showing a high resemblance to ground truth results.
  • Figure 4: Qualitative results of part label transfer on ScanObjectNN scanobjectnn. Note that both source and target point clouds were arbitrarily rotated at evaluation, but have been aligned in the figure for better visibility of part label transfer results. The results show that RIST reasonably predicts the semantic correspondences between arbitrarily rotated and partial real point clouds.
  • Figure 5: Percentage of Correct Keypoints (PCK) for the 12 categories of the KeypointNet dataset you2020keypointnet with and without rotation augmentations during training. RIST consistently outperforms previous approaches on all classes and thresholds in both settings.
  • ...and 8 more figures