Table of Contents
Fetching ...

Fast-SAM3D: 3Dfy Anything in Images but Faster

Weilun Feng, Mingqiang Wu, Zhiliang Chen, Chuanguang Yang, Haotong Qin, Yuqi Li, Xiaokun Liu, Guoxin Fan, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu

TL;DR

This work addresses the prohibitive latency of SAM3D in single-view open-world 3D reconstruction. It diagnoses inference dynamics and multi-level heterogeneity across shape, layout, texture, and mesh decoding, then proposes a training-free framework, Fast-SAM3D, with three heterogeneity-aware modules: Modality-Aware Step Caching, Joint Spatiotemporal Token Carving, and Spectral-Aware Token Aggregation. Together, these components adapt computation to instantaneous generation complexity, yielding up to 2.67x end-to-end speedups with minimal fidelity loss and establishing a new Pareto frontier for efficient 3D generation. The approach is demonstrated across diverse objects and scenes, highlighting robust geometry, preserved semantics, and practical feasibility for interactive 3D content creation.

Abstract

SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the \textbf{first systematic investigation} into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline's inherent multi-level \textbf{heterogeneity}: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present \textbf{Fast-SAM3D}, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) \textit{Modality-Aware Step Caching} to decouple structural evolution from sensitive layout updates; (2) \textit{Joint Spatiotemporal Token Carving} to concentrate refinement on high-entropy regions; and (3) \textit{Spectral-Aware Token Aggregation} to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to \textbf{2.67$\times$} end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation. Our code is released in https://github.com/wlfeng0509/Fast-SAM3D.

Fast-SAM3D: 3Dfy Anything in Images but Faster

TL;DR

This work addresses the prohibitive latency of SAM3D in single-view open-world 3D reconstruction. It diagnoses inference dynamics and multi-level heterogeneity across shape, layout, texture, and mesh decoding, then proposes a training-free framework, Fast-SAM3D, with three heterogeneity-aware modules: Modality-Aware Step Caching, Joint Spatiotemporal Token Carving, and Spectral-Aware Token Aggregation. Together, these components adapt computation to instantaneous generation complexity, yielding up to 2.67x end-to-end speedups with minimal fidelity loss and establishing a new Pareto frontier for efficient 3D generation. The approach is demonstrated across diverse objects and scenes, highlighting robust geometry, preserved semantics, and practical feasibility for interactive 3D content creation.

Abstract

SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the \textbf{first systematic investigation} into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline's inherent multi-level \textbf{heterogeneity}: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present \textbf{Fast-SAM3D}, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) \textit{Modality-Aware Step Caching} to decouple structural evolution from sensitive layout updates; (2) \textit{Joint Spatiotemporal Token Carving} to concentrate refinement on high-entropy regions; and (3) \textit{Spectral-Aware Token Aggregation} to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to \textbf{2.67} end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation. Our code is released in https://github.com/wlfeng0509/Fast-SAM3D.
Paper Structure (72 sections, 22 equations, 12 figures, 10 tables)

This paper contains 72 sections, 22 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Fast-SAM3D accelerates the state-of-the-art single-view reconstruction model SAM3D sam3d by up to 2.67$\times$, while maintaining the geometric fidelity and semantic consistency.
  • Figure 2: Overview of the proposed Fast-SAM3D framework. Our approach integrates three heterogeneity-aware modules designed to align computation with the specific dynamics of each stage: (Stage 1) Modality-Aware Step Caching disentangles the smooth evolution of shape tokens from the sensitive trajectory of layout tokens; (Stage 2) Joint Spatiotemporal Token Carving dynamically eliminates redundancy by concentrating refinement compute solely on high-entropy regions; and (Stage 3) Spectral-Aware Token Aggregation adapts the decoding grid density based on the instance-specific geometric complexity.
  • Figure 3: Pipeline characterization and bottleneck analysis.(a) The standard two-stage coarse-to-fine architecture of SAM3D. (b) Latency scaling analysis revealing the dominant computational costs: the linear scaling of iterative denoising steps in the generators and the combinatorial complexity of processing dense voxel tokens in the mesh decoder.
  • Figure 4: Illustration of modality heterogeneity. A comparison of update trajectories for shape tokens versus layout tokens. While shape tokens evolve along a smooth path amenable to extrapolation, layout tokens exhibit high-frequency volatility. More analysis in Appendix Sec. \ref{['app:more_heterogeneity']}.
  • Figure 5: Visualization of intrinsic refinement sparsity.(a) Real change map demonstrates that significant updates are spatially sparse, and our unified saliency map accurately predicts this pattern. (b) Temporal feature difference plots confirm the diffusion trajectory is non-uniform, validating our dynamic reusing strategy. More analysis in Appendix Sec. \ref{['app:more_carving']}.
  • ...and 7 more figures