Table of Contents
Fetching ...

MambaFlow: A Novel and Flow-guided State Space Model for Scene Flow Estimation

Jiehao Luo, Jintao Cheng, Xiaoyu Tang, Qingwen Zhang, Bohuan Xue, Rui Fan

TL;DR

MambaFlow is proposed, a novel scene flow estimation network with a mamba-based decoder that enables deep interaction and coupling of spatio-temporal features using a well-designed backbone and proposes a novel scene-adaptive loss function that automatically adapts to different motion patterns.

Abstract

Scene flow estimation aims to predict 3D motion from consecutive point cloud frames, which is of great interest in autonomous driving field. Existing methods face challenges such as insufficient spatio-temporal modeling and inherent loss of fine-grained feature during voxelization. However, the success of Mamba, a representative state space model (SSM) that enables global modeling with linear complexity, provides a promising solution. In this paper, we propose MambaFlow, a novel scene flow estimation network with a mamba-based decoder. It enables deep interaction and coupling of spatio-temporal features using a well-designed backbone. Innovatively, we steer the global attention modeling of voxel-based features with point offset information using an efficient Mamba-based decoder, learning voxel-to-point patterns that are used to devoxelize shared voxel representations into point-wise features. To further enhance the model's generalization capabilities across diverse scenarios, we propose a novel scene-adaptive loss function that automatically adapts to different motion patterns.Extensive experiments on the Argoverse 2 benchmark demonstrate that MambaFlow achieves state-of-the-art performance with real-time inference speed among existing works, enabling accurate flow estimation in real-world urban scenarios. The code is available at https://github.com/SCNU-RISLAB/MambaFlow.

MambaFlow: A Novel and Flow-guided State Space Model for Scene Flow Estimation

TL;DR

MambaFlow is proposed, a novel scene flow estimation network with a mamba-based decoder that enables deep interaction and coupling of spatio-temporal features using a well-designed backbone and proposes a novel scene-adaptive loss function that automatically adapts to different motion patterns.

Abstract

Scene flow estimation aims to predict 3D motion from consecutive point cloud frames, which is of great interest in autonomous driving field. Existing methods face challenges such as insufficient spatio-temporal modeling and inherent loss of fine-grained feature during voxelization. However, the success of Mamba, a representative state space model (SSM) that enables global modeling with linear complexity, provides a promising solution. In this paper, we propose MambaFlow, a novel scene flow estimation network with a mamba-based decoder. It enables deep interaction and coupling of spatio-temporal features using a well-designed backbone. Innovatively, we steer the global attention modeling of voxel-based features with point offset information using an efficient Mamba-based decoder, learning voxel-to-point patterns that are used to devoxelize shared voxel representations into point-wise features. To further enhance the model's generalization capabilities across diverse scenarios, we propose a novel scene-adaptive loss function that automatically adapts to different motion patterns.Extensive experiments on the Argoverse 2 benchmark demonstrate that MambaFlow achieves state-of-the-art performance with real-time inference speed among existing works, enabling accurate flow estimation in real-world urban scenarios. The code is available at https://github.com/SCNU-RISLAB/MambaFlow.

Paper Structure

This paper contains 28 sections, 18 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of voxelization and devoxelization for scene flow estimation. (a) Previous methods typically use two consecutive frames with coarse devoxelization that assigns identical features to points in the same voxel, causing inherent feature loss. (b) Our MambaFlow leverages N consecutive scans for richer temporal information, with refined devoxelization that learns distinct voxel-to-point patterns for fine-grained feature representation.
  • Figure 2: Overall architecture of MambaFlow. The network first voxelizes and encodes five consecutive scans, forming 4D features by concatenating 3D voxel representations along the temporal dimension. These features are processed by our spatio-temporal coupling network for multi-scale feature learning. The decoder then learns voxel-to-point patterns through cascaded FlowSSM layers, enabling point-wise feature differentiation within the same voxel and generating the scene flow through an MLP layer.
  • Figure 3: Architecture of the Spatio-temporal Deep Coupling Block. (a) The baseline Spatio-temporal Decomposition Block kim2024flow4d processes features through repeated convolutions at each stage. (b) Our proposed Spatio-temporal Deep Coupling Block achieves more efficient feature extraction by removing redundant convolutions modules and introducing a cross-timestep branch. The right panel (c)-(e) shows the detailed structures of Soft Feature Selection Mechanism and gating mechanisms used in Spatio-temporal Deep Coupling Block.
  • Figure 4: MambaFlow Decoder architecture. Point-wise features and voxel features are first serialized through Z-order space-filling curves for spatial proximity preservation. The FlowSSM module consists of N cascaded FlowSSM layers, where point offset features guide the learning of voxel-to-point patterns in each layer for refined feature reconstruction. The final output is obtained through deserialization and feature fusion with point offset information.
  • Figure 5: Qualitative results on the Argoverse 2 validation set. From left to right: Ground Truth, DeFlow, Flow4D, and our proposed MambaFlow. The color legend indicates both speed (shown by color intensity) and motion angle (2D), aligned with the vehicle's forward direction. The highlighted regions (yellow circles) demonstrate our method's superior performance in capturing both static and dynamic object motions, especially for challenging cases with complex motion patterns.