Table of Contents
Fetching ...

StreamMOS: Streaming Moving Object Segmentation with Multi-View Perception and Dual-Span Memory

Zhiheng Li, Yubo Cui, Jiexi Zhong, Zheng Fang

TL;DR

This work tackles online LiDAR-based moving object segmentation by addressing frame-to-frame inconsistency through a streaming architecture that maintains short-term feature memory and long-term prediction memory. It leverages a multi-view feature encoder with cascaded projections and asymmetric convolution to capture object motion from BEV and RV representations, followed by deformable-attention temporal fusion. A two-stage training regime and a dual voting mechanism (voxel- and instance-based) refine predictions using historical context, improving temporal continuity and spatial integrity. Experiments on SemanticKITTI-MOS and Sipailou Campus demonstrate competitive IoU gains and robust performance with real-time-like efficiency, validating the effectiveness of memory-driven streaming for MOS in autonomous systems.

Abstract

Moving object segmentation based on LiDAR is a crucial and challenging task for autonomous driving and mobile robotics. Most approaches explore spatio-temporal information from LiDAR sequences to predict moving objects in the current frame. However, they often focus on transferring temporal cues in a single inference and regard every prediction as independent of others. This may cause inconsistent segmentation results for the same object in different frames. To overcome this issue, we propose a streaming network with a memory mechanism, called StreamMOS, to build the association of features and predictions among multiple inferences. Specifically, we utilize a short-term memory to convey historical features, which can be regarded as spatial prior of moving objects and adopted to enhance current inference by temporal fusion. Meanwhile, we build a long-term memory to store previous predictions and exploit them to refine the present forecast at voxel and instance levels through voting. Besides, we present multi-view encoder with cascade projection and asymmetric convolution to extract motion feature of objects in different representations. Extensive experiments validate that our algorithm gets competitive performance on SemanticKITTI and Sipailou Campus datasets. Code will be released at https://github.com/NEU-REAL/StreamMOS.git.

StreamMOS: Streaming Moving Object Segmentation with Multi-View Perception and Dual-Span Memory

TL;DR

This work tackles online LiDAR-based moving object segmentation by addressing frame-to-frame inconsistency through a streaming architecture that maintains short-term feature memory and long-term prediction memory. It leverages a multi-view feature encoder with cascaded projections and asymmetric convolution to capture object motion from BEV and RV representations, followed by deformable-attention temporal fusion. A two-stage training regime and a dual voting mechanism (voxel- and instance-based) refine predictions using historical context, improving temporal continuity and spatial integrity. Experiments on SemanticKITTI-MOS and Sipailou Campus demonstrate competitive IoU gains and robust performance with real-time-like efficiency, validating the effectiveness of memory-driven streaming for MOS in autonomous systems.

Abstract

Moving object segmentation based on LiDAR is a crucial and challenging task for autonomous driving and mobile robotics. Most approaches explore spatio-temporal information from LiDAR sequences to predict moving objects in the current frame. However, they often focus on transferring temporal cues in a single inference and regard every prediction as independent of others. This may cause inconsistent segmentation results for the same object in different frames. To overcome this issue, we propose a streaming network with a memory mechanism, called StreamMOS, to build the association of features and predictions among multiple inferences. Specifically, we utilize a short-term memory to convey historical features, which can be regarded as spatial prior of moving objects and adopted to enhance current inference by temporal fusion. Meanwhile, we build a long-term memory to store previous predictions and exploit them to refine the present forecast at voxel and instance levels through voting. Besides, we present multi-view encoder with cascade projection and asymmetric convolution to extract motion feature of objects in different representations. Extensive experiments validate that our algorithm gets competitive performance on SemanticKITTI and Sipailou Campus datasets. Code will be released at https://github.com/NEU-REAL/StreamMOS.git.
Paper Structure (22 sections, 10 equations, 8 figures, 6 tables)

This paper contains 22 sections, 10 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Pipeline comparison of moving object segmentation approaches. We compare the structure of proposed StreamMOS with previous methods in (a) and (b). Meanwhile, the segmentation results obtained by our method achieve better spatial integrity and temporal continuity in (c).
  • Figure 2: The overall architecture of StreamMOS. (a) Feature encoder adopts a point-wise encoder to extract point features and project them into BEV. Then, the multi-view encoder with cascaded structure and asymmetric convolution is applied to encode motion features from different views. (b) Temporal fusion utilizes an attention module to propagate memory feature to the current inference. (c) Segmentation decoder with parameter-free upsampling exploits multi-scale features to predict class labels. (d) Voting mechanism leverages memory predictions to optimize the motion state of each 3D voxel and instance.
  • Figure 3: Illustration of asymmetric convolution and multi-view features.
  • Figure 4: The details of our voting mechanism. It uses voxel-based voting (VBV) and instance-based voting (IBV) to refine coarse predictions.
  • Figure 5: The visualization of MOS results on the SemanticKITTI validation set. Incorrect predictions are highlighted, with false negatives marked by green circles and false positives by blue circles. Best viewed in color and zoom.
  • ...and 3 more figures