Fast-SAM3D: 3Dfy Anything in Images but Faster
Weilun Feng, Mingqiang Wu, Zhiliang Chen, Chuanguang Yang, Haotong Qin, Yuqi Li, Xiaokun Liu, Guoxin Fan, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
TL;DR
This work addresses the prohibitive latency of SAM3D in single-view open-world 3D reconstruction. It diagnoses inference dynamics and multi-level heterogeneity across shape, layout, texture, and mesh decoding, then proposes a training-free framework, Fast-SAM3D, with three heterogeneity-aware modules: Modality-Aware Step Caching, Joint Spatiotemporal Token Carving, and Spectral-Aware Token Aggregation. Together, these components adapt computation to instantaneous generation complexity, yielding up to 2.67x end-to-end speedups with minimal fidelity loss and establishing a new Pareto frontier for efficient 3D generation. The approach is demonstrated across diverse objects and scenes, highlighting robust geometry, preserved semantics, and practical feasibility for interactive 3D content creation.
Abstract
SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the \textbf{first systematic investigation} into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline's inherent multi-level \textbf{heterogeneity}: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present \textbf{Fast-SAM3D}, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) \textit{Modality-Aware Step Caching} to decouple structural evolution from sensitive layout updates; (2) \textit{Joint Spatiotemporal Token Carving} to concentrate refinement on high-entropy regions; and (3) \textit{Spectral-Aware Token Aggregation} to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to \textbf{2.67$\times$} end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation. Our code is released in https://github.com/wlfeng0509/Fast-SAM3D.
