Table of Contents
Fetching ...

An intuitive multi-frequency feature representation for SO(3)-equivariant networks

Dongwon Son, Jaehyung Kim, Sanghyeon Son, Beomjoon Kim

TL;DR

This paper tackles the limited expressivity of SO(3)-equivariant networks by introducing a Frequency-based Equivariant Feature Representation (fer) that maps a 3D point to a high-dimensional, rotation-equivariant feature. The core idea is to construct a mapping $D: SO(3) \to SO(n)$ via skew-symmetric generators $\vec{J}=[J_1,J_2,J_3]$ satisfying specific commutator and periodicity conditions, yielding $D(R)=\exp(\theta \hat{\omega} \cdot \vec{J}) \in SO(n)$ whose spectrum encodes multiple frequencies up to $\lfloor( n-1)/2 \rfloor$. By concatenating multi-frequency fer features with Vector Neurons and feeding them into standard 3D backbones like PointNet and DGCNN, the approach achieves state-of-the-art performance among equivariant methods across tasks including shape completion, shape compression, normal estimation, registration, and classification/segmentation, with notable improvements in high-frequency detail capture under rotations. The work provides theoretical guarantees of equivariance, demonstrates practical gains on diverse 3D vision benchmarks, and offers reproducibility resources, advancing robust 3D understanding in rotational settings.

Abstract

The usage of 3D vision algorithms, such as shape reconstruction, remains limited because they require inputs to be at a fixed canonical rotation. Recently, a simple equivariant network, Vector Neuron (VN) has been proposed that can be easily used with the state-of-the-art 3D neural network (NN) architectures. However, its performance is limited because it is designed to use only three-dimensional features, which is insufficient to capture the details present in 3D data. In this paper, we introduce an equivariant feature representation for mapping a 3D point to a high-dimensional feature space. Our feature can discern multiple frequencies present in 3D data, which is the key to designing an expressive feature for 3D vision tasks. Our representation can be used as an input to VNs, and the results demonstrate that with our feature representation, VN captures more details, overcoming the limitation raised in its original paper.

An intuitive multi-frequency feature representation for SO(3)-equivariant networks

TL;DR

This paper tackles the limited expressivity of SO(3)-equivariant networks by introducing a Frequency-based Equivariant Feature Representation (fer) that maps a 3D point to a high-dimensional, rotation-equivariant feature. The core idea is to construct a mapping via skew-symmetric generators satisfying specific commutator and periodicity conditions, yielding whose spectrum encodes multiple frequencies up to . By concatenating multi-frequency fer features with Vector Neurons and feeding them into standard 3D backbones like PointNet and DGCNN, the approach achieves state-of-the-art performance among equivariant methods across tasks including shape completion, shape compression, normal estimation, registration, and classification/segmentation, with notable improvements in high-frequency detail capture under rotations. The work provides theoretical guarantees of equivariance, demonstrates practical gains on diverse 3D vision benchmarks, and offers reproducibility resources, advancing robust 3D understanding in rotational settings.

Abstract

The usage of 3D vision algorithms, such as shape reconstruction, remains limited because they require inputs to be at a fixed canonical rotation. Recently, a simple equivariant network, Vector Neuron (VN) has been proposed that can be easily used with the state-of-the-art 3D neural network (NN) architectures. However, its performance is limited because it is designed to use only three-dimensional features, which is insufficient to capture the details present in 3D data. In this paper, we introduce an equivariant feature representation for mapping a 3D point to a high-dimensional feature space. Our feature can discern multiple frequencies present in 3D data, which is the key to designing an expressive feature for 3D vision tasks. Our representation can be used as an input to VNs, and the results demonstrate that with our feature representation, VN captures more details, overcoming the limitation raised in its original paper.
Paper Structure (36 sections, 24 theorems, 85 equations, 9 figures, 9 tables, 2 algorithms)

This paper contains 36 sections, 24 theorems, 85 equations, 9 figures, 9 tables, 2 algorithms.

Key Result

Theorem 1

If $J_i \in\mathbb{R}^{n\times n}$$\forall i\in\{1,2,3\}$ satisfies $-J_i=J_i^T$, $[J_1, J_2] = J_3$, $[J_2, J_3] = J_1$, $[J_3, J_1] = J_2$ where $[A, B] = AB - BA$, and $\exp(2m\pi J_i) = I_{n \times n}, \forall m\in \mathbb{Z}$, where $\mathbb{Z}$ is the space of integers, then $D(R)=\exp(\theta

Figures (9)

  • Figure 1: EGAD morrison2020egad meshes constructed from the embeddings given by different models based on OccNet mescheder2019occupancy at canonical poses. As already noted in their original paper, VN-OccNet (3rd column), the VN version of OccNet, fails to capture the details present in the ground-truth shapes and does worse than OccNet (2nd column). Using our feature representation, VN-OccNet qualitatively performs better than OccNet (4th column). Note that each of these shapes consists of multiple frequencies -- in some parts of the object, the shape changes abruptly, while in some parts, it changes very smoothly.
  • Figure 2: Intuition of our equivariant feature representation, $\psi$, that maps a point in 2D to 3D (i.e. $n=3$) for an illustrative purpose. (Left) The basis axis in 2D is $\hat{z}=[0,1]$, and $\hat{u} = R^{\hat{z}}(\hat{u})\hat{z}$, with $\theta$ as its amount of rotation from $\hat{z}$. Our $D$ is constructed so that it defines the same amount of rotation, $\theta$, but from a basis in the 3D space, which in this case is chosen to be $\hat{e}=[0,0,1]$. The feature representation of $\hat{u}$ is given by $\psi(\hat{u})=D(R^{\hat{z}}(\hat{u}))\hat{e}$. As you can see, when $\theta$ changes, it rotates both $\hat{u}$ and $\psi(\hat{u})$ by the same amount. Note that the description of magnitude is neglected for brevity.
  • Figure 3: Reconstructions of meshes from point cloud inputs across three models: the original OccNet mescheder2019occupancy (bottom), VN-OccNet deng2021vector (middle), and our proposed model (top).
  • Figure 3: Registration results on the ShapeNet dataset. The metric is Chamfer Distance. Bold is the best performance.
  • Figure 4: The left graph shows the volumetric IoU of OccNet, VN-OccNet, and fer-vn-OccNet across the complexity level in the EGAD training set. We apply rotational augmentation during both training and test time. The right graph shows fer-vn-OccNet's IoU improvement over VN-OccNet.
  • ...and 4 more figures

Theorems & Definitions (40)

  • Theorem 1
  • Proposition 1
  • Proposition 2
  • Theorem 2
  • Proposition 3
  • Theorem 3
  • Theorem 3
  • Lemma 1
  • proof
  • Lemma 2
  • ...and 30 more