LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation

Lei Yao; Yi Wang; Yawen Cui; Moyun Liu; Lap-Pui Chau

LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation

Lei Yao, Yi Wang, Yawen Cui, Moyun Liu, Lap-Pui Chau

TL;DR

LaSSM addresses two core bottlenecks in query-based 3D instance segmentation from point clouds: how to initialize a high-quality set of queries and how to refine them efficiently. It introduces a hierarchical semantic-spatial query initializer that derives query contents and coordinates from superpoints by jointly considering semantic cues and spatial distribution, and a coordinate-guided state space model (SSM) decoder with a local aggregation module and a spatial dual-path SSM to refine queries with positional awareness. Through extensive ablations, the authors demonstrate that the initializer improves coverage and convergence, while the decoder provides efficient, accurate refinement with reduced computational cost. The method yields state-of-the-art results on ScanNet++ V2 with only about one-third of the FLOPs and shows competitive performance on multiple indoor benchmarks, highlighting its practical impact for scalable large-scale 3D scene understanding.

Abstract

Query-based 3D scene instance segmentation from point clouds has attained notable performance. However, existing methods suffer from the query initialization dilemma due to the sparse nature of point clouds and rely on computationally intensive attention mechanisms in query decoders. We accordingly introduce LaSSM, prioritizing simplicity and efficiency while maintaining competitive performance. Specifically, we propose a hierarchical semantic-spatial query initializer to derive the query set from superpoints by considering both semantic cues and spatial distribution, achieving comprehensive scene coverage and accelerated convergence. We further present a coordinate-guided state space model (SSM) decoder that progressively refines queries. The novel decoder features a local aggregation scheme that restricts the model to focus on geometrically coherent regions and a spatial dual-path SSM block to capture underlying dependencies within the query set by integrating associated coordinates information. Our design enables efficient instance prediction, avoiding the incorporation of noisy information and reducing redundant computation. LaSSM ranks first place on the latest ScanNet++ V2 leaderboard, outperforming the previous best method by 2.5% mAP with only 1/3 FLOPs, demonstrating its superiority in challenging large-scale scene instance segmentation. LaSSM also achieves competitive performance on ScanNet, ScanNet200, S3DIS and ScanNet++ V1 benchmarks with less computational cost. Extensive ablation studies and qualitative results validate the effectiveness of our design. The code and weights are available at https://github.com/RayYoh/LaSSM.

LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation

TL;DR

Abstract

Paper Structure (17 sections, 15 equations, 9 figures, 12 tables, 1 algorithm)

This paper contains 17 sections, 15 equations, 9 figures, 12 tables, 1 algorithm.

Introduction
Related Work
Query-based 3D Instance Segmentation
Efficient Query Decoders
State Space Models in Computer Vision
Methodology
Preliminaries
Overall Architecture
Hierarchical Semantic-spatial Query Initializer
Coordinate-guided SSM Query Decoder
Training and Inference
Experiments
Comparison with State-of-the-art Methods
Analysis and Ablation Study
Qualitative Results
...and 2 more sections

Figures (9)

Figure 1: Query distribution and performance comparison. (a) We compare query distributions of farthest point sampling (FPS) schult2023mask3d, semantic confidence-based selection (Semantic) he2023fastinst, and our method on different scenes. (b) Compared to SPFormer sun2023spformer, OneFormer3D kolodiazhnyi2023of3d and SGIFormer yao2024SGIFormer, LaSSM achieves the balance between performance and GPU efficiency.
Figure 2: Architecture of LaSSM. The input point cloud is processed by the feature extractor to obtain superpoint features ${\mathbf{F}}_s$ and coordinates ${\mathbf{C}}_s$. Then the hierarchical semantic-spatial initializer is employed to initialize query contents ${\mathbf{Q}}$ and coordinates ${\mathbf{Q}}_c$ (Sec. \ref{['subsec:saqs']}). The resulting query set is further refined by the coordinate-guided SSM query decoder, which iteratively updates query contents and coordinates to predict instances (Sec. \ref{['subsec:pgqd']}). FPS denotes farthest point sampling.
Figure 3: Detailed architecture of spatial dual-path SSM block.
Figure 4: Convergence speed. LaSSM enables faster convergence and better performance than farthest point sampling (FPS) schult2023mask3d and semantic-guided he2023fastinst query selection methods.
Figure 5: Visualization of query distributions. We emphasize the comparison of FPS and Semantic in the red and blue boxes, respectively.
...and 4 more figures

LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation

TL;DR

Abstract

LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)