Table of Contents
Fetching ...

CV-MOS: A Cross-View Model for Motion Segmentation

Xiaoyu Tang, Zeyu Chen, Jintao Cheng, Xieyuanli Chen, Jin Wu, Bohuan Xue

TL;DR

CV-MOS tackles LiDAR MOS by introducing a cross-view, tri-branch architecture that fuses motion cues from RV and BEV residual maps while leveraging range-image semantics for guidance. The method includes a Cross-View Motion Branch, a dual-branch motion feature encoder, a GFNET-inspired fusion, and a Spatial-Channel Attention Module to mitigate information loss from projection. It achieves state-of-the-art IoU on SemanticKITTI-MOS (77.5% validation, 79.2% test) and shows strong generalization on Apollo, outperforming RV- and BEV-based baselines with improved efficiency. This cross-view strategy addresses occlusion, boundary blur, and distant-object sparsity, delivering robust MOS in dynamic driving scenarios.

Abstract

In autonomous driving, accurately distinguishing between static and moving objects is crucial for the autonomous driving system. When performing the motion object segmentation (MOS) task, effectively leveraging motion information from objects becomes a primary challenge in improving the recognition of moving objects. Previous methods either utilized range view (RV) or bird's eye view (BEV) residual maps to capture motion information. Unlike traditional approaches, we propose combining RV and BEV residual maps to exploit a greater potential of motion information jointly. Thus, we introduce CV-MOS, a cross-view model for moving object segmentation. Novelty, we decouple spatial-temporal information by capturing the motion from BEV and RV residual maps and generating semantic features from range images, which are used as moving object guidance for the motion branch. Our direct and unique solution maximizes the use of range images and RV and BEV residual maps, significantly enhancing the performance of LiDAR-based MOS task. Our method achieved leading IoU(\%) scores of 77.5\% and 79.2\% on the validation and test sets of the SemanticKitti dataset. In particular, CV-MOS demonstrates SOTA performance to date on various datasets. The CV-MOS implementation is available at https://github.com/SCNU-RISLAB/CV-MOS

CV-MOS: A Cross-View Model for Motion Segmentation

TL;DR

CV-MOS tackles LiDAR MOS by introducing a cross-view, tri-branch architecture that fuses motion cues from RV and BEV residual maps while leveraging range-image semantics for guidance. The method includes a Cross-View Motion Branch, a dual-branch motion feature encoder, a GFNET-inspired fusion, and a Spatial-Channel Attention Module to mitigate information loss from projection. It achieves state-of-the-art IoU on SemanticKITTI-MOS (77.5% validation, 79.2% test) and shows strong generalization on Apollo, outperforming RV- and BEV-based baselines with improved efficiency. This cross-view strategy addresses occlusion, boundary blur, and distant-object sparsity, delivering robust MOS in dynamic driving scenarios.

Abstract

In autonomous driving, accurately distinguishing between static and moving objects is crucial for the autonomous driving system. When performing the motion object segmentation (MOS) task, effectively leveraging motion information from objects becomes a primary challenge in improving the recognition of moving objects. Previous methods either utilized range view (RV) or bird's eye view (BEV) residual maps to capture motion information. Unlike traditional approaches, we propose combining RV and BEV residual maps to exploit a greater potential of motion information jointly. Thus, we introduce CV-MOS, a cross-view model for moving object segmentation. Novelty, we decouple spatial-temporal information by capturing the motion from BEV and RV residual maps and generating semantic features from range images, which are used as moving object guidance for the motion branch. Our direct and unique solution maximizes the use of range images and RV and BEV residual maps, significantly enhancing the performance of LiDAR-based MOS task. Our method achieved leading IoU(\%) scores of 77.5\% and 79.2\% on the validation and test sets of the SemanticKitti dataset. In particular, CV-MOS demonstrates SOTA performance to date on various datasets. The CV-MOS implementation is available at https://github.com/SCNU-RISLAB/CV-MOS
Paper Structure (18 sections, 14 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 14 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: The core idea of the proposed cross-view model. We prioritize leveraging motion information as the primary feature, with semantic information serving as supplementary feature. To achieve this, we propose transitioning from a single-branch motion model to a dual-branch structure. This involves shifting the input source for motion information from a single-view projection to a cross-view motion data, thereby enhancing the representation of motion information.
  • Figure 2: The problems with RV and BEV projection. We can observe that RV projection suffers from issues with occlusion and distorted proportions of objects, leading to imbalance and distortion in object aspect ratios after projection, such as the deformation of a car in a range image. BEV projection does not the above-mentioned issues. However, it introduces quantization error when dividing the space into voxels or pillars, which is unfriendly for distant objects that may only have a few points, such as the small distant object in the bottom left corner of the above picture.
  • Figure 3: CV-MOS is a cross-view network model with three inputs and two outputs. Range images along with RV and BEV residual maps are respectively fed into their designated branches. Both RV and BEV residual maps are jointly provided to the motion branch to facilitate the fusion of motion features. Initially, the RV motion branch outputs the first-stage prediction results. To achieve more refined segmentation, the final layer features of the RV motion branch are input into the SCAM, which then produces the final point cloud segmentation results.
  • Figure 4: The range image first locates the position index of its corresponding 3D point through indexing, then extracts the points from the corresponding BEV residual map based on the position index, and merges the RV and BEV residual feature map into the corresponding positions in the range image. Fusion is then performed in the following attention fusion network.
  • Figure 5: The SCAM, through spatial and channel attention mechanisms, guides voxel blocks to retain crucial information while filtering out interfering information.
  • ...and 3 more figures