Table of Contents
Fetching ...

MambaMOS: LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model

Kang Zeng, Hao Shi, Jiacheng Lin, Siyu Li, Jintao Cheng, Kaiwei Wang, Zhiyong Li, Kailun Yang

TL;DR

This work tackles LiDAR MOS by addressing the weak coupling between temporal and spatial cues. It introduces TCBE to amplify temporal dominance and MSSM to enable deep, cross-scan motion-spatial interactions within a U-Net framework, leveraging 4D point clouds serialized via space-filling curves. Together, these components yield state-of-the-art MOS performance on SemanticKITTI-MOS and KITTI-Road, demonstrating strong generalization and robustness. The study also marks the first application of a State Space Model to MOS, opening avenues for efficient long-range temporal modeling in dynamic scene understanding.

Abstract

LiDAR-based Moving Object Segmentation (MOS) aims to locate and segment moving objects in point clouds of the current scan using motion information from previous scans. Despite the promising results achieved by previous MOS methods, several key issues, such as the weak coupling of temporal and spatial information, still need further study. In this paper, we propose a novel LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model, termed MambaMOS. Firstly, we develop a novel embedding module, the Time Clue Bootstrapping Embedding (TCBE), to enhance the coupling of temporal and spatial information in point clouds and alleviate the issue of overlooked temporal clues. Secondly, we introduce the Motion-aware State Space Model (MSSM) to endow the model with the capacity to understand the temporal correlations of the same object across different time steps. Specifically, MSSM emphasizes the motion states of the same object at different time steps through two distinct temporal modeling and correlation steps. We utilize an improved state space model to represent these motion differences, significantly modeling the motion states. Finally, extensive experiments on the SemanticKITTI-MOS and KITTI-Road benchmarks demonstrate that the proposed MambaMOS achieves state-of-the-art performance. The source code is publicly available at https://github.com/Terminal-K/MambaMOS.

MambaMOS: LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model

TL;DR

This work tackles LiDAR MOS by addressing the weak coupling between temporal and spatial cues. It introduces TCBE to amplify temporal dominance and MSSM to enable deep, cross-scan motion-spatial interactions within a U-Net framework, leveraging 4D point clouds serialized via space-filling curves. Together, these components yield state-of-the-art MOS performance on SemanticKITTI-MOS and KITTI-Road, demonstrating strong generalization and robustness. The study also marks the first application of a State Space Model to MOS, opening avenues for efficient long-range temporal modeling in dynamic scene understanding.

Abstract

LiDAR-based Moving Object Segmentation (MOS) aims to locate and segment moving objects in point clouds of the current scan using motion information from previous scans. Despite the promising results achieved by previous MOS methods, several key issues, such as the weak coupling of temporal and spatial information, still need further study. In this paper, we propose a novel LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model, termed MambaMOS. Firstly, we develop a novel embedding module, the Time Clue Bootstrapping Embedding (TCBE), to enhance the coupling of temporal and spatial information in point clouds and alleviate the issue of overlooked temporal clues. Secondly, we introduce the Motion-aware State Space Model (MSSM) to endow the model with the capacity to understand the temporal correlations of the same object across different time steps. Specifically, MSSM emphasizes the motion states of the same object at different time steps through two distinct temporal modeling and correlation steps. We utilize an improved state space model to represent these motion differences, significantly modeling the motion states. Finally, extensive experiments on the SemanticKITTI-MOS and KITTI-Road benchmarks demonstrate that the proposed MambaMOS achieves state-of-the-art performance. The source code is publicly available at https://github.com/Terminal-K/MambaMOS.
Paper Structure (16 sections, 16 equations, 3 figures, 6 tables)

This paper contains 16 sections, 16 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: A brief comparison of other non-projection methods (sub-figure (a)) with ours (sub-figure (b)). The prior methods treated temporal information $t$ and spatially occupied information $O$ equally. In contrast, our method emphasizes the primacy of temporal information more through our designed TCBE and achieves a deeper coupling of temporal and spatial information with MSSM, which aligns more closely with the fundamental principles of motion recognition.
  • Figure 2: The overview of our proposed MambaMOS. The previous $F-1$ scans, after undergoing viewpoint transformation, are overlaid with the current scan to form a 4D point cloud. This 4D point cloud is then serialized to obtain a sequence as input. After passing through TCBE, the coupling degree between temporal and spatial information in the input is enhanced and fed into a symmetric encoder-decoder architecture (the pink box). Each stage of the encoder/decoder consists of a pooling/unpooling layer and $N$ blocks (the blue box). MSSM serves as the core of each block to achieve deep-level coupling of temporal and spatial features. Finally, the MOS result in the current scan can be obtained from the output of the decoder by a linear layer.
  • Figure 3: Visualization comparison of MambaMOS with MF-MOS mfmos, InsMOS insmos, and 4DMOS 4dmos on the SemanticKITTI validation set. We overlay the predictions for the current scan and the past seven scans to visually demonstrate the results.