Table of Contents
Fetching ...

Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent-Interpolation Initialization for 3D Instance Segmentation

Jiahao Lu, Jiacheng Deng, Tianzhu Zhang

TL;DR

This work addresses 3D instance segmentation with transformers by tackling two key issues: suboptimal query initialization that balances foreground coverage with content learning, and inter-layer recall decay caused by per-layer supervision. It introduces BFL, combining the Agent-Interpolation Initialization Module (AI2M) with a Hierarchical Query Fusion Decoder (HQFD) to maintain high recall while preserving informative content in queries. The approach leverages Sparse UNet features, FPS-based position queries, and agent-driven content interpolation, fused across decoder layers via IoU-guided query retention to prevent object disappearance. Across ScanNetV2, ScanNet200, ScanNet++, and S3DIS, BFL achieves state-of-the-art performance among transformer-based 3D instance segmentation methods, with improved recall stability and faster convergence. The design also provides a plug-in capability for other transformer-based models, offering practical impact for robust 3D scene understanding in AR/VR and robotics applications.

Abstract

3D instance segmentation aims to predict a set of object instances in a scene and represent them as binary foreground masks with corresponding semantic labels. Currently, transformer-based methods are gaining increasing attention due to their elegant pipelines, reduced manual selection of geometric properties, and superior performance. However, transformer-based methods fail to simultaneously maintain strong position and content information during query initialization. Additionally, due to supervision at each decoder layer, there exists a phenomenon of object disappearance with the deepening of layers. To overcome these hurdles, we introduce Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent-Interpolation Initialization for 3D Instance Segmentation (BFL). Specifically, an Agent-Interpolation Initialization Module is designed to generate resilient queries capable of achieving a balance between foreground coverage and content learning. Additionally, a Hierarchical Query Fusion Decoder is designed to retain low overlap queries, mitigating the decrease in recall with the deepening of layers. Extensive experiments on ScanNetV2, ScanNet200, ScanNet++ and S3DIS datasets demonstrate the superior performance of BFL.

Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent-Interpolation Initialization for 3D Instance Segmentation

TL;DR

This work addresses 3D instance segmentation with transformers by tackling two key issues: suboptimal query initialization that balances foreground coverage with content learning, and inter-layer recall decay caused by per-layer supervision. It introduces BFL, combining the Agent-Interpolation Initialization Module (AI2M) with a Hierarchical Query Fusion Decoder (HQFD) to maintain high recall while preserving informative content in queries. The approach leverages Sparse UNet features, FPS-based position queries, and agent-driven content interpolation, fused across decoder layers via IoU-guided query retention to prevent object disappearance. Across ScanNetV2, ScanNet200, ScanNet++, and S3DIS, BFL achieves state-of-the-art performance among transformer-based 3D instance segmentation methods, with improved recall stability and faster convergence. The design also provides a plug-in capability for other transformer-based models, offering practical impact for robust 3D scene understanding in AR/VR and robotics applications.

Abstract

3D instance segmentation aims to predict a set of object instances in a scene and represent them as binary foreground masks with corresponding semantic labels. Currently, transformer-based methods are gaining increasing attention due to their elegant pipelines, reduced manual selection of geometric properties, and superior performance. However, transformer-based methods fail to simultaneously maintain strong position and content information during query initialization. Additionally, due to supervision at each decoder layer, there exists a phenomenon of object disappearance with the deepening of layers. To overcome these hurdles, we introduce Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent-Interpolation Initialization for 3D Instance Segmentation (BFL). Specifically, an Agent-Interpolation Initialization Module is designed to generate resilient queries capable of achieving a balance between foreground coverage and content learning. Additionally, a Hierarchical Query Fusion Decoder is designed to retain low overlap queries, mitigating the decrease in recall with the deepening of layers. Extensive experiments on ScanNetV2, ScanNet200, ScanNet++ and S3DIS datasets demonstrate the superior performance of BFL.

Paper Structure

This paper contains 25 sections, 8 equations, 18 figures, 13 tables.

Figures (18)

  • Figure 1: The phenomenon of Object Disappearance with the deepening of layers.
  • Figure 2: (a) The comparison of different query initialization methods. The FPS-based methods conduct farthest point sampling separately for each scene, placing more emphasis on positional information but lacking in aggregating content information. The learnable-based methods initialize a fixed number of queries for aggregating content information across all scenes, which is prone to empty sampling, thereby compromising foreground coverage. Our method leverages the advantages of both approaches to achieve a balanced and comprehensive solution. (b) The recall difference. The recall of the baseline shows instability during the iterative optimization process across layers, whereas our method, with the assistance of the Hierarchical Query Fusion Decoder, demonstrates a steady improvement in recall across each layer.
  • Figure 3: The overall framework of our method BFL. The Agent-Interpolation Initialization Module is meticulously crafted to synergize the strengths of FPS and learnable queries, producing object queries better suited for complex and dynamic environments. The Hierarchical Query Fusion Decoder is utilized to retain low overlap queries that aid in recall rate.
  • Figure 4: Comparison on ScanNet200 validation set. ScanNet200 employs the same point cloud data as ScanNetV2 but enhances more annotation diversity, with 198 instance classes.
  • Figure 5: Effectiveness of the Agent-Interpolation Initialization Module. We evaluate the performance of the first layer predictions on ScanNetV2 validation set.
  • ...and 13 more figures