Table of Contents
Fetching ...

SAMa: Material-aware 3D Selection and Segmentation

Michael Fischer, Iliyan Georgiev, Thibault Groueix, Vladimir G. Kim, Tobias Ritschel, Valentin Deschaintre

TL;DR

SAMa addresses the challenge of material selection on 3D objects by adapting a video-based material selector (SAM2) through fine-tuning on a material-focused dataset to achieve multiview-consistent 2D materials predictions. It then lifts these 2D similarities into a lightweight 3D similarity point cloud via depth back-projection and nearest-neighbor queries, enabling interactive, cross-view material selection across arbitrary 3D representations (NeRFs, 3D Gaussians, meshes) without per-asset optimization. The approach yields improved selection accuracy and multiview consistency over strong baselines, while offering fast per-view visualization and broad applicability to segmentation and editing tasks. This work enables material-aware editing, replacement, and segmentation in 3D synthesis pipelines, enhancing X-to-3D workflows and downstream material manipulation across multiple representations.

Abstract

Decomposing 3D assets into material parts is a common task for artists and creators, yet remains a highly manual process. In this work, we introduce Select Any Material (SAMa), a material selection approach for various 3D representations. Building on the recently introduced SAM2 video selection model, we extend its capabilities to the material domain. We leverage the model's cross-view consistency to create a 3D-consistent intermediate material-similarity representation in the form of a point cloud from a sparse set of views. Nearest-neighbour lookups in this similarity cloud allow us to efficiently reconstruct accurate continuous selection masks over objects' surfaces that can be inspected from any view. Our method is multiview-consistent by design, alleviating the need for contrastive learning or feature-field pre-processing, and performs optimization-free selection in seconds. Our approach works on arbitrary 3D representations and outperforms several strong baselines in terms of selection accuracy and multiview consistency. It enables several compelling applications, such as replacing the diffuse-textured materials on a text-to-3D output, or selecting and editing materials on NeRFs and 3D-Gaussians.

SAMa: Material-aware 3D Selection and Segmentation

TL;DR

SAMa addresses the challenge of material selection on 3D objects by adapting a video-based material selector (SAM2) through fine-tuning on a material-focused dataset to achieve multiview-consistent 2D materials predictions. It then lifts these 2D similarities into a lightweight 3D similarity point cloud via depth back-projection and nearest-neighbor queries, enabling interactive, cross-view material selection across arbitrary 3D representations (NeRFs, 3D Gaussians, meshes) without per-asset optimization. The approach yields improved selection accuracy and multiview consistency over strong baselines, while offering fast per-view visualization and broad applicability to segmentation and editing tasks. This work enables material-aware editing, replacement, and segmentation in 3D synthesis pipelines, enhancing X-to-3D workflows and downstream material manipulation across multiple representations.

Abstract

Decomposing 3D assets into material parts is a common task for artists and creators, yet remains a highly manual process. In this work, we introduce Select Any Material (SAMa), a material selection approach for various 3D representations. Building on the recently introduced SAM2 video selection model, we extend its capabilities to the material domain. We leverage the model's cross-view consistency to create a 3D-consistent intermediate material-similarity representation in the form of a point cloud from a sparse set of views. Nearest-neighbour lookups in this similarity cloud allow us to efficiently reconstruct accurate continuous selection masks over objects' surfaces that can be inspected from any view. Our method is multiview-consistent by design, alleviating the need for contrastive learning or feature-field pre-processing, and performs optimization-free selection in seconds. Our approach works on arbitrary 3D representations and outperforms several strong baselines in terms of selection accuracy and multiview consistency. It enables several compelling applications, such as replacing the diffuse-textured materials on a text-to-3D output, or selecting and editing materials on NeRFs and 3D-Gaussians.

Paper Structure

This paper contains 25 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview over our method. Starting from a 3D asset and a user click, we sample cameras and create a set of renderings covering the object, which we subsequently process with our similarity network SAMa to compute dense per-pixel similarity values. We then back-project these values to 3D and store them in a point cloud than can be efficiently queried and interpolated for novel views.
  • Figure 2: Schematic overview of our fine-tuned model. The image encoder (in blue) is frozen, all other blocks (in red) are fine-tuned. Given an input image and a clicked pixel, the model outputs a material similarity map. Figure adapted from ravi2024sam2.
  • Figure 3: Effects of fine-tuning on images vs. videos. Top row shows the clicked frame. Bottom row shows an unclicked frame for which the similarity map is inferred from the model's memory.
  • Figure 4: Effects of duplicating the clicked frame in the sequence. Similarity after frame duplication is significantly cleaner, as the model is forced to use the memory module.
  • Figure 5: kNN 3D voting significantly reduces noise and improves selection quality, as seen from the insets.
  • ...and 4 more figures