Table of Contents
Fetching ...

CompetitorFormer: Competitor Transformer for 3D Instance Segmentation

Duanchu Wang, Jing Liu, Haoran Gong, Yinghui Quan, Di Wang

TL;DR

This work tackles inter-query competition in transformer-based 3D instance segmentation, where many queries are allocated per scene and multiple queries can chase the same instance. It proposes CompetitorFormer, a set of plug-and-play designs—Query Competition Layer (QCL), Relative Relationship Encoding (RRE), and Rank Cross Attention (RCA)—to create spatial, competitive, and semantic cues that promote a dominant query and suppress competitors. By integrating these designs with state-of-the-art baselines (e.g., SPFormer, MAFT, OneFormer3D) and using a Sparse 3D U-Net backbone with flexible pooling, the approach yields consistent improvements across ScanNetv2, ScanNet200, S3DIS, and STPLS3D, including state-of-the-art results on several metrics. Ablation studies confirm the individual and combined value of QCL, RRE, and RCA, while qualitative analyses illustrate a faster emergence of the primary predictions and reduced competition among queries. Limitations include compatibility constraints with frameworks that assume a fixed number of queries, suggesting avenues to extend the method to additional 3D tasks such as object detection and panoptic segmentation.

Abstract

Transformer-based methods have become the dominant approach for 3D instance segmentation. These methods predict instance masks via instance queries, ranking them by classification confidence and IoU scores to select the top prediction as the final outcome. However, it has been observed that the current models employ a fixed and higher number of queries than the instances present within a scene. In such instances, multiple queries predict the same instance, yet only a single query is ultimately optimized. The close scores of queries in the lower-level decoders make it challenging for the dominant query to distinguish itself rapidly, which ultimately impairs the model's accuracy and convergence efficiency. This phenomenon is referred to as inter-query competition. To address this challenge, we put forth a series of plug-and-play competition-oriented designs, collectively designated as the CompetitorFormer, with the aim of reducing competition and facilitating a dominant query. Experiments showed that integrating our designs with state-of-the-art frameworks consistently resulted in significant performance improvements in 3D instance segmentation across a range of datasets.

CompetitorFormer: Competitor Transformer for 3D Instance Segmentation

TL;DR

This work tackles inter-query competition in transformer-based 3D instance segmentation, where many queries are allocated per scene and multiple queries can chase the same instance. It proposes CompetitorFormer, a set of plug-and-play designs—Query Competition Layer (QCL), Relative Relationship Encoding (RRE), and Rank Cross Attention (RCA)—to create spatial, competitive, and semantic cues that promote a dominant query and suppress competitors. By integrating these designs with state-of-the-art baselines (e.g., SPFormer, MAFT, OneFormer3D) and using a Sparse 3D U-Net backbone with flexible pooling, the approach yields consistent improvements across ScanNetv2, ScanNet200, S3DIS, and STPLS3D, including state-of-the-art results on several metrics. Ablation studies confirm the individual and combined value of QCL, RRE, and RCA, while qualitative analyses illustrate a faster emergence of the primary predictions and reduced competition among queries. Limitations include compatibility constraints with frameworks that assume a fixed number of queries, suggesting avenues to extend the method to additional 3D tasks such as object detection and panoptic segmentation.

Abstract

Transformer-based methods have become the dominant approach for 3D instance segmentation. These methods predict instance masks via instance queries, ranking them by classification confidence and IoU scores to select the top prediction as the final outcome. However, it has been observed that the current models employ a fixed and higher number of queries than the instances present within a scene. In such instances, multiple queries predict the same instance, yet only a single query is ultimately optimized. The close scores of queries in the lower-level decoders make it challenging for the dominant query to distinguish itself rapidly, which ultimately impairs the model's accuracy and convergence efficiency. This phenomenon is referred to as inter-query competition. To address this challenge, we put forth a series of plug-and-play competition-oriented designs, collectively designated as the CompetitorFormer, with the aim of reducing competition and facilitating a dominant query. Experiments showed that integrating our designs with state-of-the-art frameworks consistently resulted in significant performance improvements in 3D instance segmentation across a range of datasets.

Paper Structure

This paper contains 31 sections, 9 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (a) The visualization of competing queries from the initial decoder layer of SPFormer spformer. The 3D bounding box represents the predicted instance. The blue color represents the ground truth, and the remaining colors (e.g., red, green, etc.) represent the prediction boxes. (b) Statistics of the average number of competing queries for SPFormer in the initial decoder layer under different numbers of queries and different IoU thresholds.
  • Figure 2: The overview of our pipeline. (a) shows the overall pipeline, in which the query competitor layer (QCL) processes input to derive spatial and competitive information, bifurcating into branches for static embedding-enhanced instance queries and dynamic self-attention weight adjustment via relative relationship encoding (RRE), followed by the rank cross attention (RCA) for query differentiation. The details of each module are shown in (b), (c), (d), respectively.
  • Figure 3: (a)&(b) Cumulative distribution of the class scores on the matched (unmatched) queries with or without QCL, (c) Cumulative distribution of the IoU scores on the matched (unmatched) queries with or without RCA.
  • Figure 4: Visualisation of competitive query in 1st/3rd/5th decoder layers for both the Competitor-SPFormer (left 3 columns) and SPFormer (right 3 columns). The depiction employs 3D bounding boxes to denote predicted instances. This scenario depicts queries in which the IoU is greater than $0.25$. The color of each fraction matches the corresponding bounding box's color, where the blue box is the ground truth box and the rest of the colored boxes (e.g., red, green, etc.) are the prediction boxes.