Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent-Interpolation Initialization for 3D Instance Segmentation
Jiahao Lu, Jiacheng Deng, Tianzhu Zhang
TL;DR
This work addresses 3D instance segmentation with transformers by tackling two key issues: suboptimal query initialization that balances foreground coverage with content learning, and inter-layer recall decay caused by per-layer supervision. It introduces BFL, combining the Agent-Interpolation Initialization Module (AI2M) with a Hierarchical Query Fusion Decoder (HQFD) to maintain high recall while preserving informative content in queries. The approach leverages Sparse UNet features, FPS-based position queries, and agent-driven content interpolation, fused across decoder layers via IoU-guided query retention to prevent object disappearance. Across ScanNetV2, ScanNet200, ScanNet++, and S3DIS, BFL achieves state-of-the-art performance among transformer-based 3D instance segmentation methods, with improved recall stability and faster convergence. The design also provides a plug-in capability for other transformer-based models, offering practical impact for robust 3D scene understanding in AR/VR and robotics applications.
Abstract
3D instance segmentation aims to predict a set of object instances in a scene and represent them as binary foreground masks with corresponding semantic labels. Currently, transformer-based methods are gaining increasing attention due to their elegant pipelines, reduced manual selection of geometric properties, and superior performance. However, transformer-based methods fail to simultaneously maintain strong position and content information during query initialization. Additionally, due to supervision at each decoder layer, there exists a phenomenon of object disappearance with the deepening of layers. To overcome these hurdles, we introduce Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent-Interpolation Initialization for 3D Instance Segmentation (BFL). Specifically, an Agent-Interpolation Initialization Module is designed to generate resilient queries capable of achieving a balance between foreground coverage and content learning. Additionally, a Hierarchical Query Fusion Decoder is designed to retain low overlap queries, mitigating the decrease in recall with the deepening of layers. Extensive experiments on ScanNetV2, ScanNet200, ScanNet++ and S3DIS datasets demonstrate the superior performance of BFL.
