Table of Contents
Fetching ...

Scale Disparity of Instances in Interactive Point Cloud Segmentation

Chenrui Han, Xuan Yu, Yuxuan Xie, Yili Liu, Sitong Mao, Shunbo Zhou, Rong Xiong, Yue Wang

TL;DR

Scale disparity between thing and stuff instances in interactive point cloud segmentation is addressed by ClickFormer, which combines a query augmentation module with a global-sampling strategy and a bidirectional query-voxel transformer using global attention. The method enables accurate segmentation across vastly different object scales while reducing user effort, outperforming baselines on outdoor and indoor datasets and showing strong open-world generalization with fewer prompts. Key contributions include scale-invariant augmentation queries, two-way attention to propagate prompts to voxel embeddings, and comprehensive ablations confirming the effectiveness of augmentation and global attention. The approach significantly advances practical 3D scene understanding by delivering robust, cross-scale segmentation with minimal interaction in open-world settings, as measured by $mIoU@k$ across diverse datasets.

Abstract

Interactive point cloud segmentation has become a pivotal task for understanding 3D scenes, enabling users to guide segmentation models with simple interactions such as clicks, therefore significantly reducing the effort required to tailor models to diverse scenarios and new categories. However, in the realm of interactive segmentation, the meaning of instance diverges from that in instance segmentation, because users might desire to segment instances of both thing and stuff categories that vary greatly in scale. Existing methods have focused on thing categories, neglecting the segmentation of stuff categories and the difficulties arising from scale disparity. To bridge this gap, we propose ClickFormer, an innovative interactive point cloud segmentation model that accurately segments instances of both thing and stuff categories. We propose a query augmentation module to augment click queries by a global query sampling strategy, thus maintaining consistent performance across different instance scales. Additionally, we employ global attention in the query-voxel transformer to mitigate the risk of generating false positives, along with several other network structure improvements to further enhance the model's segmentation performance. Experiments demonstrate that ClickFormer outperforms existing interactive point cloud segmentation methods across both indoor and outdoor datasets, providing more accurate segmentation results with fewer user clicks in an open-world setting.

Scale Disparity of Instances in Interactive Point Cloud Segmentation

TL;DR

Scale disparity between thing and stuff instances in interactive point cloud segmentation is addressed by ClickFormer, which combines a query augmentation module with a global-sampling strategy and a bidirectional query-voxel transformer using global attention. The method enables accurate segmentation across vastly different object scales while reducing user effort, outperforming baselines on outdoor and indoor datasets and showing strong open-world generalization with fewer prompts. Key contributions include scale-invariant augmentation queries, two-way attention to propagate prompts to voxel embeddings, and comprehensive ablations confirming the effectiveness of augmentation and global attention. The approach significantly advances practical 3D scene understanding by delivering robust, cross-scale segmentation with minimal interaction in open-world settings, as measured by across diverse datasets.

Abstract

Interactive point cloud segmentation has become a pivotal task for understanding 3D scenes, enabling users to guide segmentation models with simple interactions such as clicks, therefore significantly reducing the effort required to tailor models to diverse scenarios and new categories. However, in the realm of interactive segmentation, the meaning of instance diverges from that in instance segmentation, because users might desire to segment instances of both thing and stuff categories that vary greatly in scale. Existing methods have focused on thing categories, neglecting the segmentation of stuff categories and the difficulties arising from scale disparity. To bridge this gap, we propose ClickFormer, an innovative interactive point cloud segmentation model that accurately segments instances of both thing and stuff categories. We propose a query augmentation module to augment click queries by a global query sampling strategy, thus maintaining consistent performance across different instance scales. Additionally, we employ global attention in the query-voxel transformer to mitigate the risk of generating false positives, along with several other network structure improvements to further enhance the model's segmentation performance. Experiments demonstrate that ClickFormer outperforms existing interactive point cloud segmentation methods across both indoor and outdoor datasets, providing more accurate segmentation results with fewer user clicks in an open-world setting.
Paper Structure (17 sections, 6 equations, 7 figures, 3 tables)

This paper contains 17 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Scale Disparity in Interactive Segmentation. Existing methods suffer from scale disparity when segmenting stuff categories.
  • Figure 2: Network Architecture. The overall network consists of 1) a feature encoder, 2) a query augmentation module and 3) a mask decoder composed of a query-voxel transformer and a mask segmentation module.
  • Figure 3: Attention Map of Click Queries and Augmentation Queries. Red points indicate positive clicks, and green points indicate negative clicks. For the point cloud segmentation results, blue represents true positives, pink represents false positives, yellow represents false negatives, and gray represents true negatives. All visualization results follow the same color scheme.
  • Figure 4: Comparison of Global Attention (top) and Local Attention (bottom): Attention Maps and Segmentation Results.
  • Figure 5: Qualitative Results on Outdoor Datasets. Red points indicate positive clicks, and green points indicate negative clicks. For the point cloud segmentation results, blue represents true positives, pink represents false positives, yellow represents false negatives, and gray represents true negatives. All visualization results follow the same color scheme.
  • ...and 2 more figures