Table of Contents
Fetching ...

Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features

Niladri Shekhar Dutt, Sanjeev Muralikrishnan, Niloy J. Mitra

TL;DR

Diff3F presents a zero-shot semantic descriptor for untextured 3D shapes by distilling diffusion-based image features from multi-view renders guided by depth and normal cues. The method textures silhouettes via ControlNet conditioning, extracts diffusion and DINO features, and unprojects them back to 3D to produce per-vertex semantic descriptors across many views. Correspondences are computed with cosine similarity or via a Functional Maps pipeline, enabling robust point-to-point and surface-to-surface mappings across isometric and non-isometric shape families. Across SHREC'19/20, FAUST, and TOSCA, Diff3F demonstrates competitive accuracy with strong generalization while requiring no training, making it a practical complement to geometric descriptors for semantic shape analysis.

Abstract

We present Diff3F as a simple, robust, and class-agnostic feature descriptor that can be computed for untextured input shapes (meshes or point clouds). Our method distills diffusion features from image foundational models onto input shapes. Specifically, we use the input shapes to produce depth and normal maps as guidance for conditional image synthesis. In the process, we produce (diffusion) features in 2D that we subsequently lift and aggregate on the original surface. Our key observation is that even if the conditional image generations obtained from multi-view rendering of the input shapes are inconsistent, the associated image features are robust and, hence, can be directly aggregated across views. This produces semantic features on the input shapes, without requiring additional data or training. We perform extensive experiments on multiple benchmarks (SHREC'19, SHREC'20, FAUST, and TOSCA) and demonstrate that our features, being semantic instead of geometric, produce reliable correspondence across both isometric and non-isometrically related shape families. Code is available via the project page at https://diff3f.github.io/

Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features

TL;DR

Diff3F presents a zero-shot semantic descriptor for untextured 3D shapes by distilling diffusion-based image features from multi-view renders guided by depth and normal cues. The method textures silhouettes via ControlNet conditioning, extracts diffusion and DINO features, and unprojects them back to 3D to produce per-vertex semantic descriptors across many views. Correspondences are computed with cosine similarity or via a Functional Maps pipeline, enabling robust point-to-point and surface-to-surface mappings across isometric and non-isometric shape families. Across SHREC'19/20, FAUST, and TOSCA, Diff3F demonstrates competitive accuracy with strong generalization while requiring no training, making it a practical complement to geometric descriptors for semantic shape analysis.

Abstract

We present Diff3F as a simple, robust, and class-agnostic feature descriptor that can be computed for untextured input shapes (meshes or point clouds). Our method distills diffusion features from image foundational models onto input shapes. Specifically, we use the input shapes to produce depth and normal maps as guidance for conditional image synthesis. In the process, we produce (diffusion) features in 2D that we subsequently lift and aggregate on the original surface. Our key observation is that even if the conditional image generations obtained from multi-view rendering of the input shapes are inconsistent, the associated image features are robust and, hence, can be directly aggregated across views. This produces semantic features on the input shapes, without requiring additional data or training. We perform extensive experiments on multiple benchmarks (SHREC'19, SHREC'20, FAUST, and TOSCA) and demonstrate that our features, being semantic instead of geometric, produce reliable correspondence across both isometric and non-isometrically related shape families. Code is available via the project page at https://diff3f.github.io/
Paper Structure (29 sections, 13 equations, 6 figures, 6 tables)

This paper contains 29 sections, 13 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Correspondence in-the-wild. We introduce Diff3F, a novel feature distiller that harnesses the expressive power of in-painting diffusion features and distills them to points on 3D surfaces. Here, the proposed features are employed for point-to-point shape correspondence between assets varying in shape, pose, species, and topology. We achieve this without any fine-tuning of the underlying diffusion models, and demonstrate results on untextured meshes, point clouds, and raw scans. The leftmost mesh is the source, and all the remaining 3D shapes are targets. Note that we show raw point-to-point correspondence, without any regularization or smoothing. Inputs are point clouds, non-manifold meshes, or 2-manifold meshes. Corresponding points are similarly colored across the shapes.
  • Figure 2: Method overview.Diff3F is a feature distiller to map semantic diffusion features to 3D surface points. We render the given shape without textures from multiple views, and the resulting renderings are in-painted by guiding ControlNet with geometric conditions; the generative features from ControlNet are fused with features obtained from the textured rendering, followed by unprojecting to the 3D surface. Note that the textured images obtained by conditioning ControlNet from different views can be inconsistent.
  • Figure 3: Results gallery.Diff3F's performance on various point correspondence challenges. Corresponding points are similarly colored. Note that Diff3F can successfully distinguish between symmetric parts and remains fairly robust under pose and shape variations. For each shape pair, the source is on the left and the target is on the right.
  • Figure 4: Comparisons. We compare our Diff3F (bottom) against SOTA methods (i.e., DPC dpc and SE-ORNet se-ornet) for the task of point-to-point shape correspondence. Corresponding points, computed as described in Section \ref{['sec:correspondence']}, are similarly colored. We show results using point cloud rendering of our method for the human pair (left) and results with mesh rendering for the animal pair (right). \ref{['tab:comparison_of_methods']} shows qualitative evaluation on benchmarks.
  • Figure 5: Regularizing point-2-point maps. We compare the effectiveness of vanilla functional maps with the Wave Kernel Signature as descriptors (top) vs our descriptors Diff3F (bottom). Ours being semantic enables Functional Maps to work with non-isometric deformations even though FMs typically struggle with such cases when using traditional geometric descriptors. Our descriptors yield accurate correspondence in most cases, thus eliminating the need for further refinement algorithms typically used in related works.
  • ...and 1 more figures