Table of Contents
Fetching ...

Surface-Aware Distilled 3D Semantic Features

Lukas Uzolas, Elmar Eisemann, Petr Kellnhofer

TL;DR

The paper tackles the problem of robust 3D shape matching by addressing intraclass ambiguity in neural features distilled from 2D foundations. It introduces a surface-aware embedding that maps per-vertex base features to a hyperspherical space using a self-supervised contrastive loss $\,\mathcal{L}_c$ guided by geodesic distances $d_{n,a}$, complemented by a reconstruction loss $\mathcal{L}_r$ to preserve semantic content. Training requires only a small, unpaired set of meshes and yields a joint, one-shot capable embedding that generalizes to unseen shapes without fine-tuning, while remaining efficient at inference since no geodesics are needed then. The resulting features improve 3D correspondences, enable versatile downstream tasks such as one-shot pose transfer, skinning weight regression, and 2D-to-3D texturing, and demonstrate applicability across diverse classes beyond humanoids and animals. Overall, the approach provides a practical, self-supervised pathway to adapt 2D foundational models for robust 3D shape analysis and manipulation with limited data.

Abstract

Many 3D tasks such as pose alignment, animation, motion transfer, and 3D reconstruction rely on establishing correspondences between 3D shapes. This challenge has recently been approached by pairwise matching of semantic features from pre-trained vision models. However, despite their power, these features struggle to differentiate instances of the same semantic class such as ``left hand'' versus ``right hand'' which leads to substantial mapping errors. To solve this, we learn a surface-aware embedding space that is robust to these ambiguities while facilitating shared mapping for an entire family of 3D shapes. Importantly, our approach is self-supervised and requires only a small number of unpaired training meshes to infer features for new possibly imperfect 3D shapes at test time. We achieve this by introducing a contrastive loss that preserves the semantic content of the features distilled from foundational models while disambiguating features located far apart on the shape's surface. We observe superior performance in correspondence matching benchmarks and enable downstream applications including 2D-to-3D and 3D-to-3D texture transfer, in-part segmentation, pose alignment, and motion transfer in low-data regimes. Unlike previous pairwise approaches, our solution constructs a joint embedding space, where both seen and unseen 3D shapes are implicitly aligned without further optimization. The code is available at https://graphics.tudelft.nl/SurfaceAware3DFeatures.

Surface-Aware Distilled 3D Semantic Features

TL;DR

The paper tackles the problem of robust 3D shape matching by addressing intraclass ambiguity in neural features distilled from 2D foundations. It introduces a surface-aware embedding that maps per-vertex base features to a hyperspherical space using a self-supervised contrastive loss guided by geodesic distances , complemented by a reconstruction loss to preserve semantic content. Training requires only a small, unpaired set of meshes and yields a joint, one-shot capable embedding that generalizes to unseen shapes without fine-tuning, while remaining efficient at inference since no geodesics are needed then. The resulting features improve 3D correspondences, enable versatile downstream tasks such as one-shot pose transfer, skinning weight regression, and 2D-to-3D texturing, and demonstrate applicability across diverse classes beyond humanoids and animals. Overall, the approach provides a practical, self-supervised pathway to adapt 2D foundational models for robust 3D shape analysis and manipulation with limited data.

Abstract

Many 3D tasks such as pose alignment, animation, motion transfer, and 3D reconstruction rely on establishing correspondences between 3D shapes. This challenge has recently been approached by pairwise matching of semantic features from pre-trained vision models. However, despite their power, these features struggle to differentiate instances of the same semantic class such as ``left hand'' versus ``right hand'' which leads to substantial mapping errors. To solve this, we learn a surface-aware embedding space that is robust to these ambiguities while facilitating shared mapping for an entire family of 3D shapes. Importantly, our approach is self-supervised and requires only a small number of unpaired training meshes to infer features for new possibly imperfect 3D shapes at test time. We achieve this by introducing a contrastive loss that preserves the semantic content of the features distilled from foundational models while disambiguating features located far apart on the shape's surface. We observe superior performance in correspondence matching benchmarks and enable downstream applications including 2D-to-3D and 3D-to-3D texture transfer, in-part segmentation, pose alignment, and motion transfer in low-data regimes. Unlike previous pairwise approaches, our solution constructs a joint embedding space, where both seen and unseen 3D shapes are implicitly aligned without further optimization. The code is available at https://graphics.tudelft.nl/SurfaceAware3DFeatures.

Paper Structure

This paper contains 55 sections, 10 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Overview of our method. We feed images of a 3D shape rendered from multiple viewpoints to a pre-trained 2D vision model and extract features that are then projected back onto surface points $\mathbf{p}_i$ and aggregated into per-point features $\mathbf{f}_i$ (Sec. \ref{['sec:preliminaries']}). Next, we pointwise embed the base features$\mathbf{f}_i$ into our surface-aware features$\mathbf{s}_i$ residing in a lower-dimensional space learned using our contrastive loss preserving geodesic distances $d_{i,j}$ and a reconstruction loss matching decoded features $\mathbf{\bar{f}}_i$ to $\mathbf{f}_i$ (Sec. \ref{['sec:method']}). The surface-aware features$\mathbf{s}_i$ serve as robust descriptors for correspondence matching (Sec. \ref{['sec:experiments']}) and base blocks for many down-stream applications (Sec. \ref{['sec:applications']}).
  • Figure 2: Two shapes (left) and a PCA-based 2D projections of their aggregated Diff3F base features and our surface-aware features (right). Notice the separation of limbs in our result compared to Diff3F. Our features originate from the same encoder for both shapes. The animal legs appear merged along the sagittal plane due limitations of the PCA projection, but they remain disambiguated in our feature space as demonstrated in Fig. \ref{['fig:transfer_centroids']}.
  • Figure 3: One-shot pose transfer using our features, Diff3F features, or Geometric descriptors. MSE $\times10^{-4}$ is reported for human shapes.
  • Figure 4: Mean Squared Error of skinning weight regression ($\downarrow$ is better) and its distribution across the SMPL mesh surface.
  • Figure 5: Qualitative comparison on the SHREC'19 and TOSCA datasets with dense true correspondence labels provided by their authors. We show the source and target meshes with their ground truth correspondence labels (the two left-most columns) in comparison to correspondences computed using our surface-aware features (the forth column) and Diff3F base features (the right-most column). We further highlight the correspondence error on the mesh surface (the third and the fifth column). The error colormap is normalized per sample by the maximal error over both methods to keep the error scale comparable across columns but not across rows. Our surface-aware features notably improve separation of the limb instances.
  • ...and 12 more figures