Table of Contents
Fetching ...

Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction

Yun Zhou, Yaoting Wang, Guangquan Jie, Jinyu Liu, Henghui Ding

TL;DR

Problem: enabling text-referenced 3D reconstruction from a single image when SAM3D relies on explicit masks. Approach: Ref-SAM3D adds a vision-language grounded mask proposer to generate masks from a referring expression $\mathbf{t}$, producing $\mathbf{M}_i = \mathcal{M}(\mathbf{I}, \mathbf{t})$ and corresponding 3D outputs $\mathcal{R}_i = \text{SAM3D}(\mathbf{I}, \mathbf{M}_i)$, without retraining. Contributions: a simple, modular, plug-and-play framework that achieves zero-shot, text-guided 3D reconstructions for single, multi-object, and multi-instance scenarios. Significance: bridges semantic language cues and geometry, enabling more accessible 3D editing, game design, and virtual environments using only off-the-shelf models.

Abstract

SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: https://github.com/FudanCVL/Ref-SAM3D.

Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction

TL;DR

Problem: enabling text-referenced 3D reconstruction from a single image when SAM3D relies on explicit masks. Approach: Ref-SAM3D adds a vision-language grounded mask proposer to generate masks from a referring expression , producing and corresponding 3D outputs , without retraining. Contributions: a simple, modular, plug-and-play framework that achieves zero-shot, text-guided 3D reconstructions for single, multi-object, and multi-instance scenarios. Significance: bridges semantic language cues and geometry, enabling more accessible 3D editing, game design, and virtual environments using only off-the-shelf models.

Abstract

SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: https://github.com/FudanCVL/Ref-SAM3D.

Paper Structure

This paper contains 10 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: Inference pipeline of Ref-SAM3D. The pipeline takes an input image and a referring expression, which are processed by the mask proposer to generate the mask of the referred object. This mask is then passed to SAM3D with the original image for 3D object reconstruction. For simplicity, the layout, voxel, mesh, and Gaussian splat decoders are omitted. The output ${R, T, S}$ represents the layout attributes, including rotation, translation, and scaling.
  • Figure 2: Case A: Referring to and reconstructing a single object in straightforward scenarios.
  • Figure 3: Case B: Referring to and reconstructing multiple objects.
  • Figure 4: Case C: Referring to and reconstructing a single object across multiple instances of the same semantic class.