Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction
Yun Zhou, Yaoting Wang, Guangquan Jie, Jinyu Liu, Henghui Ding
TL;DR
Problem: enabling text-referenced 3D reconstruction from a single image when SAM3D relies on explicit masks. Approach: Ref-SAM3D adds a vision-language grounded mask proposer to generate masks from a referring expression $\mathbf{t}$, producing $\mathbf{M}_i = \mathcal{M}(\mathbf{I}, \mathbf{t})$ and corresponding 3D outputs $\mathcal{R}_i = \text{SAM3D}(\mathbf{I}, \mathbf{M}_i)$, without retraining. Contributions: a simple, modular, plug-and-play framework that achieves zero-shot, text-guided 3D reconstructions for single, multi-object, and multi-instance scenarios. Significance: bridges semantic language cues and geometry, enabling more accessible 3D editing, game design, and virtual environments using only off-the-shelf models.
Abstract
SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: https://github.com/FudanCVL/Ref-SAM3D.
