Table of Contents
Fetching ...

Hierarchical Point Attention for Indoor 3D Object Detection

Manli Shu, Le Xue, Ning Yu, Roberto Martín-Martín, Caiming Xiong, Tom Goldstein, Juan Carlos Niebles, Ran Xu

TL;DR

This work tackles the challenge of imbalanced 3D object detection performance on indoor scenes by introducing hierarchical attention modules for point-based transformers. Aggregated Multi-Scale Attention (MS-A) generates higher-density multi-scale tokens from a single-scale input to enrich cross-attention, while Size-Adaptive Local Attention (Local-A) confines attention to adaptive regions inside bounding-box proposals. Both modules are model-agnostic and plug into existing transformer-based detectors, yielding consistent improvements on ScanNetV2 and SUN RGB-D, with the largest gains for small objects. By enabling fine-grained geometric and localized feature learning, the approach enhances indoor 3D perception and broadens the applicability of transformer-based detectors to cluttered environments.

Abstract

3D object detection is an essential vision technique for various robotic systems, such as augmented reality and domestic robots. Transformers as versatile network architectures have recently seen great success in 3D point cloud object detection. However, the lack of hierarchy in a plain transformer restrains its ability to learn features at different scales. Such limitation makes transformer detectors perform worse on smaller objects and affects their reliability in indoor environments where small objects are the majority. This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors. First, we propose Aggregated Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning. Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals. Both attention operations are model-agnostic network modules that can be plugged into existing point cloud transformers for end-to-end training. We evaluate our method on two widely used indoor detection benchmarks. By plugging our proposed modules into the state-of-the-art transformer-based 3D detectors, we improve the previous best results on both benchmarks, with more significant improvements on smaller objects.

Hierarchical Point Attention for Indoor 3D Object Detection

TL;DR

This work tackles the challenge of imbalanced 3D object detection performance on indoor scenes by introducing hierarchical attention modules for point-based transformers. Aggregated Multi-Scale Attention (MS-A) generates higher-density multi-scale tokens from a single-scale input to enrich cross-attention, while Size-Adaptive Local Attention (Local-A) confines attention to adaptive regions inside bounding-box proposals. Both modules are model-agnostic and plug into existing transformer-based detectors, yielding consistent improvements on ScanNetV2 and SUN RGB-D, with the largest gains for small objects. By enabling fine-grained geometric and localized feature learning, the approach enhances indoor 3D perception and broadens the applicability of transformer-based detectors to cluttered environments.

Abstract

3D object detection is an essential vision technique for various robotic systems, such as augmented reality and domestic robots. Transformers as versatile network architectures have recently seen great success in 3D point cloud object detection. However, the lack of hierarchy in a plain transformer restrains its ability to learn features at different scales. Such limitation makes transformer detectors perform worse on smaller objects and affects their reliability in indoor environments where small objects are the majority. This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors. First, we propose Aggregated Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning. Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals. Both attention operations are model-agnostic network modules that can be plugged into existing point cloud transformers for end-to-end training. We evaluate our method on two widely used indoor detection benchmarks. By plugging our proposed modules into the state-of-the-art transformer-based 3D detectors, we improve the previous best results on both benchmarks, with more significant improvements on smaller objects.
Paper Structure (12 sections, 4 equations, 5 figures, 4 tables)

This paper contains 12 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: A point-based 3D transformer detectors with our proposed modules (MS-A and Local-A). Detector overview: the raw point cloud is downsampled during encoding to obtain the point features, which serve as the key and value in the transformed decoder. Object candidates are sampled from the point features (e.g., using FPS). The transformer decoder learns object features via alternating self- and cross-attentions. The proposed MS-A and Local-A are cross-attention modules that can be plugged into the transformer. See Fig. \ref{['fig:ms-a']} and Fig. \ref{['fig:local-a']} for the design details of each module.
  • Figure 2: Aggregated Multi-Scale Attention (MS-A) learns features at different scales within the multi-head cross-attention design. It constructs higher resolution (i.e., higher point density) point features from the single-scale input point features and uses keys and values of both scales.
  • Figure 3: Size-Adaptive Local Attention (Local-A) performs adaptive local attention between each object candidate (query) and the points inside its corresponding bounding box proposal. The attention range (the token lengths of keys and values) varies across different object candidates, so perform padding/truncating to allow batch processing.
  • Figure 4: Performance on object of different sizes. We define the S/M/L thresholds based on each dataset's statistics (volume distribution).
  • Figure 5: Qualitative results on SUN RGB-D (top) and ScanNetV2 (bottom). The color of a bounding box in the middle two columns stands for the semantic label of the object. In the last column, we draw both the ground truth (in green) and the prediction (in blue) of the object. We highlight the points that belong to an object for better visualization. In the last column, we visualize the attention weight of the last transformer layer (before applying Local-A). We visualize the cross-attention weight between an object candidate and the point cloud.