Table of Contents
Fetching ...

CCTNet: A Circular Convolutional Transformer Network for LiDAR-based Place Recognition Handling Movable Objects Occlusion

Gang Wang, Chaoran Zhu, Qian Xu, Tongzhou Zhang, Hai Zhang, XiaoPeng Fan, Jue Hu

TL;DR

CCTNet introduces a Circular Convolution Module (CCM) and a Range Transformer Module (RTM) to enhance LiDAR-based place recognition under movable-object occlusion. It converts point clouds to yaw-angle-equivalent range images, expands horizontal receptive fields with circular convolution, and fuses channel and spatial cues to produce rotation-invariant descriptors, trained with an overlap-based regression loss. The approach achieves state-of-the-art Recall@1 and Recall@1% on KITTI and Ford Campus, and demonstrates strong generalization on a self-collected dataset under occlusion, while maintaining real-time performance. These contributions offer robust loop-closure detection in SLAM and relocalization tasks across diverse sensing setups and dynamic environments.

Abstract

Place recognition is a fundamental task for robotic application, allowing robots to perform loop closure detection within simultaneous localization and mapping (SLAM), and achieve relocalization on prior maps. Current range image-based networks use single-column convolution to maintain feature invariance to shifts in image columns caused by LiDAR viewpoint change.However, this raises the issues such as "restricted receptive fields" and "excessive focus on local regions", degrading the performance of networks. To address the aforementioned issues, we propose a lightweight circular convolutional Transformer network denoted as CCTNet, which boosts performance by capturing structural information in point clouds and facilitating crossdimensional interaction of spatial and channel information. Initially, a Circular Convolution Module (CCM) is introduced, expanding the network's perceptual field while maintaining feature consistency across varying LiDAR perspectives. Then, a Range Transformer Module (RTM) is proposed, which enhances place recognition accuracy in scenarios with movable objects by employing a combination of channel and spatial attention mechanisms. Furthermore, we propose an Overlap-based loss function, transforming the place recognition task from a binary loop closure classification into a regression problem linked to the overlap between LiDAR frames. Through extensive experiments on the KITTI and Ford Campus datasets, CCTNet surpasses comparable methods, achieving Recall@1 of 0.924 and 0.965, and Recall@1% of 0.990 and 0.993 on the test set, showcasing a superior performance. Results on the selfcollected dataset further demonstrate the proposed method's potential for practical implementation in complex scenarios to handle movable objects, showing improved generalization in various datasets.

CCTNet: A Circular Convolutional Transformer Network for LiDAR-based Place Recognition Handling Movable Objects Occlusion

TL;DR

CCTNet introduces a Circular Convolution Module (CCM) and a Range Transformer Module (RTM) to enhance LiDAR-based place recognition under movable-object occlusion. It converts point clouds to yaw-angle-equivalent range images, expands horizontal receptive fields with circular convolution, and fuses channel and spatial cues to produce rotation-invariant descriptors, trained with an overlap-based regression loss. The approach achieves state-of-the-art Recall@1 and Recall@1% on KITTI and Ford Campus, and demonstrates strong generalization on a self-collected dataset under occlusion, while maintaining real-time performance. These contributions offer robust loop-closure detection in SLAM and relocalization tasks across diverse sensing setups and dynamic environments.

Abstract

Place recognition is a fundamental task for robotic application, allowing robots to perform loop closure detection within simultaneous localization and mapping (SLAM), and achieve relocalization on prior maps. Current range image-based networks use single-column convolution to maintain feature invariance to shifts in image columns caused by LiDAR viewpoint change.However, this raises the issues such as "restricted receptive fields" and "excessive focus on local regions", degrading the performance of networks. To address the aforementioned issues, we propose a lightweight circular convolutional Transformer network denoted as CCTNet, which boosts performance by capturing structural information in point clouds and facilitating crossdimensional interaction of spatial and channel information. Initially, a Circular Convolution Module (CCM) is introduced, expanding the network's perceptual field while maintaining feature consistency across varying LiDAR perspectives. Then, a Range Transformer Module (RTM) is proposed, which enhances place recognition accuracy in scenarios with movable objects by employing a combination of channel and spatial attention mechanisms. Furthermore, we propose an Overlap-based loss function, transforming the place recognition task from a binary loop closure classification into a regression problem linked to the overlap between LiDAR frames. Through extensive experiments on the KITTI and Ford Campus datasets, CCTNet surpasses comparable methods, achieving Recall@1 of 0.924 and 0.965, and Recall@1% of 0.990 and 0.993 on the test set, showcasing a superior performance. Results on the selfcollected dataset further demonstrate the proposed method's potential for practical implementation in complex scenarios to handle movable objects, showing improved generalization in various datasets.
Paper Structure (19 sections, 11 equations, 10 figures, 6 tables)

This paper contains 19 sections, 11 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Query frame (blue) and reference frame (green) at adjacent locations with movable objects. 1) Occlusions at A), B), and C) typically occupy multiple pixel columns. The limited receptive perception of single-column convolution hinders the network from capturing global information. 2) Despite the two frames originating from the same place, the presence of movable objects at A) compared to a) captured by a single spatial attention mechanism leads to higher weights allocated to that region. Our circular convolutional Transformer network is able to broaden the horizontal receptive field, and facilitate cross-dimensional interaction of spatial and channel information.
  • Figure 2: The pipeline of the proposed method. The highlighted orange section comprises the main content of this paper.
  • Figure 3: Illustration of single-column convolution, traditional multi-column convolution, and circular convolution. (a) Given the small receptive field, it is challenging to capture contextual information between the kernel and the entire image. In (b) and (c), the viewpoint change of LiDAR causes column shifts in the range image, leading to changes introduced by zero-padding at both ends during feature extraction. (d) Circular convolution transforms the range image into a 360° panoramic image, which effectively addresses issues related to viewpoint changes, ensuring consistent features extracted from multiple columns.
  • Figure 4: Illustration of circular convolution. The circular range image is first generated by expanding both ends of the range image, followed by performing convolution using kernels with a fixed stride.
  • Figure 5: RTM attention structure schematic. The channel attention of RTM consists of average pooling, 1D convolution, and a Sigmoid activation function. The spatial attention concatenates the outputs of average pooling and max pooling along the channel dimension. After 2D convolution and Sigmoid activation, the attention weight tensor is calculated, which is then used to reassign the feature map weights.
  • ...and 5 more figures