Table of Contents
Fetching ...

SAM 3D: 3Dfy Anything in Images

SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, Jitendra Malik

TL;DR

SAM 3D introduces a foundation-style model for visually grounded 3D reconstruction from a single image, predicting per-object geometry, texture, and layout. It combines a two-stage architecture (Geometry and Texture & Refinement) with a multi-stage training pipeline: synthetic pretraining, semi-synthetic mid-training, and a real-world, human-in-the-loop post-training data engine (MITL) that leverages model-in-the-loop selections and professional artists. The approach achieves state-of-the-art performance across 3D shape, texture, and scene layout, validated on a new in-the-wild SA-3DAO benchmark and real-world datasets, with a reported 5:1 head-to-head win rate in human preferences and strong generalization under occlusion. By releasing code, weights, a demo, and SA-3DAO, the work provides a scalable path toward robust, real-world 3D perception across applications like robotics and AR/VR, addressing the historic data barrier in 3D learning.

Abstract

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

SAM 3D: 3Dfy Anything in Images

TL;DR

SAM 3D introduces a foundation-style model for visually grounded 3D reconstruction from a single image, predicting per-object geometry, texture, and layout. It combines a two-stage architecture (Geometry and Texture & Refinement) with a multi-stage training pipeline: synthetic pretraining, semi-synthetic mid-training, and a real-world, human-in-the-loop post-training data engine (MITL) that leverages model-in-the-loop selections and professional artists. The approach achieves state-of-the-art performance across 3D shape, texture, and scene layout, validated on a new in-the-wild SA-3DAO benchmark and real-world datasets, with a reported 5:1 head-to-head win rate in human preferences and strong generalization under occlusion. By releasing code, weights, a demo, and SA-3DAO, the work provides a scalable path toward robust, real-world 3D perception across applications like robotics and AR/VR, addressing the historic data barrier in 3D learning.

Abstract

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

Paper Structure

This paper contains 105 sections, 6 equations, 22 figures, 12 tables.

Figures (22)

  • Figure 1: SAM 3D converts a single image into a composable 3D scene made of individual objects. Our method predicts per-object geometry, texture, and layout, enabling full scene reconstruction. Bottom: high-quality 3D assets recovered for each object.
  • Figure 2: SAM 3D architecture. (top) SAM 3D first predicts coarse shape and layout with the Geometry model; (right) the mixture of transformers architecture apply a two-stream approach with information sharing in the multi-modal self-attention layer. (bottom) The voxels predicted by the Geometry model are passed to the Texture & Refinement model, which adds higher resolution detail and textures.
  • Figure 3: SAM 3D data, with a green outline around the target object, and the ground truth mesh shown in the bottom right. Samples are divided into four rows, based on type. Art-3DO meshes are untextured, while the rest may be textured or not, depending on the underlying asset (Iso-3DO, RP-3DO) or if the mesh was annotated for texture (MITL-3DO).
  • Figure 4: SAM 3D training paradigm. We employ a multi-stage pipeline incrementally exposing the model to increasingly complex data and modalities.
  • Figure 5: Life of an example going through the data collection pipeline. We streamline annotation by breaking it into subtasks: annotators first choose target objects (Stage 1); rank and select 3D model candidates (Stage 2); then pose these models within a 2.5D scene (Stage 3). Stages 2 and 3 use model-in-the-loop.
  • ...and 17 more figures