Table of Contents
Fetching ...

FAST3DIS: Feed-forward Anchored Scene Transformer for 3D Instance Segmentation

Changyang Li, Xueqing Huang, Shin-Fang Chng, Huangying Zhan, Qingan Yan, Yi Xu

Abstract

While recent feed-forward 3D reconstruction models provide a strong geometric foundation for scene understanding, extending them to 3D instance segmentation typically relies on a disjointed "lift-and-cluster" paradigm. Grouping dense pixel-wise embeddings via non-differentiable clustering scales poorly with the number of views and disconnects representation learning from the final segmentation objective. In this paper, we present a Feed-forward Anchored Scene Transformer for 3D Instance Segmentation (FAST3DIS), an end-to-end approach that effectively bypasses post-hoc clustering. We introduce a 3D-anchored, query-based Transformer architecture built upon a foundational depth backbone, adapted efficiently to learn instance-specific semantics while retaining its zero-shot geometric priors. We formulate a learned 3D anchor generator coupled with an anchor-sampling cross-attention mechanism for view-consistent 3D instance segmentation. By projecting 3D object queries directly into multi-view feature maps, our method samples context efficiently. Furthermore, we introduce a dual-level regularization strategy, that couples multi-view contrastive learning with a dynamically scheduled spatial overlap penalty to explicitly prevent query collisions and ensure precise instance boundaries. Experiments on complex indoor 3D datasets demonstrate that our approach achieves competitive segmentation accuracy with significantly improved memory scalability and inference speed over state-of-the-art clustering-based methods.

FAST3DIS: Feed-forward Anchored Scene Transformer for 3D Instance Segmentation

Abstract

While recent feed-forward 3D reconstruction models provide a strong geometric foundation for scene understanding, extending them to 3D instance segmentation typically relies on a disjointed "lift-and-cluster" paradigm. Grouping dense pixel-wise embeddings via non-differentiable clustering scales poorly with the number of views and disconnects representation learning from the final segmentation objective. In this paper, we present a Feed-forward Anchored Scene Transformer for 3D Instance Segmentation (FAST3DIS), an end-to-end approach that effectively bypasses post-hoc clustering. We introduce a 3D-anchored, query-based Transformer architecture built upon a foundational depth backbone, adapted efficiently to learn instance-specific semantics while retaining its zero-shot geometric priors. We formulate a learned 3D anchor generator coupled with an anchor-sampling cross-attention mechanism for view-consistent 3D instance segmentation. By projecting 3D object queries directly into multi-view feature maps, our method samples context efficiently. Furthermore, we introduce a dual-level regularization strategy, that couples multi-view contrastive learning with a dynamically scheduled spatial overlap penalty to explicitly prevent query collisions and ensure precise instance boundaries. Experiments on complex indoor 3D datasets demonstrate that our approach achieves competitive segmentation accuracy with significantly improved memory scalability and inference speed over state-of-the-art clustering-based methods.

Paper Structure

This paper contains 27 sections, 9 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the FAST3DIS framework. Given unposed multi-view RGB images: (1) Dual-Pass Backbone: A frozen Depth Anything V3 (DA3) depthanything3 extracts dense depth and camera parameters, while a LoRA hu2022lora-adapted pathway extracts multi-scale view features $\mathcal{F}$. (2) Geometry-Injected Pixel Decoder: The predicted geometry is injected to generate a 3D-aware feature pyramid $\mathcal{F}_{\text{dec}}$ and mask features $F_{\text{mask}}$. (3) 3D Anchor Generator: Explicit 3D anchors $A$ and content queries $C_q$ are produced conditioned on the global scene context $\gamma$. (4) Anchor-Sampling Transformer Decoder: 3D Anchors are onto the 2D views to sample local multi-view features from $\mathcal{F}_{\text{dec}}$. Finally, the predicted geometry and masks $\hat{M}$ are assembled into 3D instances.
  • Figure 2: Comparisons with IGGT li2025iggt. Left: Inference time comparison (log scale). Right: Qualitative segmentation results. Red circles highlight small objects that IGGT fails to isolate and incorrectly merges into the supporting background.
  • Figure 3: Qualitative comparison with IGGT. Red circles highlight adjacent objects that IGGT incorrectly merges into a single instance.
  • Figure 4: Qualitative ablation of our proposed regularization strategies. Left: Impact of explicit feature and spatial regularization. The full model successfully maintains consistent instance identities across different viewpoints (highlighted in green). Without $\mathcal{L}_{\text{geo}}$ and $\mathcal{L}_{\text{overlap}}$ (see Section \ref{['sec:loss']}), the model suffers from severe cross-view association failures (the table and the pillow, highlighted in read) and "query hijacking", resulting in physically overlapping masks. Right: Impact of the dynamic overlap penalty. A fixed, static penalty forces the network to conservatively predict artificial background gaps (black regions) between adjacent objects. In contrast, our dynamic scheduling ensures more precise and contiguous instance boundaries.