Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform
Chunghyun Park, Seungwook Kim, Jaesik Park, Minsu Cho
TL;DR
This work tackles the problem of establishing dense 3D semantic correspondences between shapes under arbitrary rotations. It introduces RIST, a self-supervised framework that uses an SO(3)-equivariant encoder to produce a global descriptor $Z\in\mathbb{R}^{C\times3}$ and per-point, dynamic SO(3)-invariant local shape transforms $f_{\theta_i}$ to map $Z$ to local descriptors, enabling self- and cross-reconstruction with a rotation-robust decoder. By supervising these reconstructions, RIST aligns semantically corresponding points through similar local descriptors, yielding state-of-the-art performance on rotated 3D part label transfer and 3D keypoint transfer on ShapeNetPart, ScanObjectNN, and KeypointNet. The method demonstrates strong rotation robustness and generalization to real-world data, offering a practical pathway to dense 3D annotation and downstream 3D understanding tasks. Key innovations include the dynamic, per-point local shape transform and the use of SO(3)-equivariant/invariant representations to guarantee rotational robustness.$Z \in \mathbb{R}^{C\times 3}$, $f_{\theta_i}: \mathbb{R}^{C\times3} \to \mathbb{R}^{C'\times3}$.
Abstract
Establishing accurate 3D correspondences between shapes stands as a pivotal challenge with profound implications for computer vision and robotics. However, existing self-supervised methods for this problem assume perfect input shape alignment, restricting their real-world applicability. In this work, we introduce a novel self-supervised Rotation-Invariant 3D correspondence learner with Local Shape Transform, dubbed RIST, that learns to establish dense correspondences between shapes even under challenging intra-class variations and arbitrary orientations. Specifically, RIST learns to dynamically formulate an SO(3)-invariant local shape transform for each point, which maps the SO(3)-equivariant global shape descriptor of the input shape to a local shape descriptor. These local shape descriptors are provided as inputs to our decoder to facilitate point cloud self- and cross-reconstruction. Our proposed self-supervised training pipeline encourages semantically corresponding points from different shapes to be mapped to similar local shape descriptors, enabling RIST to establish dense point-wise correspondences. RIST demonstrates state-of-the-art performances on 3D part label transfer and semantic keypoint transfer given arbitrarily rotated point cloud pairs, outperforming existing methods by significant margins.
