SSRFlow: Semantic-aware Fusion with Spatial Temporal Re-embedding for Real-world Scene Flow
Zhiyang Lu, Qinghan Chen, Zhimin Yuan, Ming Cheng
TL;DR
SSRFlow tackles real-world scene flow estimation by addressing global semantic alignment, temporal-spatial distortions after warping, and domain gaps between synthetic and LiDAR data. It introduces a three-pronged architecture: (1) Global Fusion Flow Embedding with Dual Cross Attentive fusion to fuse and align semantic contexts across frames, (2) Spatial Temporal Re-embedding to re-encode spatiotemporal features after warping for refined residual flow, and (3) Domain Adaptive Losses to bridge synthetic-to-real motion inference. The method yields state-of-the-art results on FT3D, KITTI-based datasets, and LiDAR-centered benchmarks, with notable improvements in real-world scenarios and faster inference. These contributions enhance robustness and generalization for downstream dynamic scene understanding in autonomous systems.
Abstract
Scene flow, which provides the 3D motion field of the first frame from two consecutive point clouds, is vital for dynamic scene perception. However, contemporary scene flow methods face three major challenges. Firstly, they lack global flow embedding or only consider the context of individual point clouds before embedding, leading to embedded points struggling to perceive the consistent semantic relationship of another frame. To address this issue, we propose a novel approach called Dual Cross Attentive (DCA) for the latent fusion and alignment between two frames based on semantic contexts. This is then integrated into Global Fusion Flow Embedding (GF) to initialize flow embedding based on global correlations in both contextual and Euclidean spaces. Secondly, deformations exist in non-rigid objects after the warping layer, which distorts the spatiotemporal relation between the consecutive frames. For a more precise estimation of residual flow at next-level, the Spatial Temporal Re-embedding (STR) module is devised to update the point sequence features at current-level. Lastly, poor generalization is often observed due to the significant domain gap between synthetic and LiDAR-scanned datasets. We leverage novel domain adaptive losses to effectively bridge the gap of motion inference from synthetic to real-world. Experiments demonstrate that our approach achieves state-of-the-art (SOTA) performance across various datasets, with particularly outstanding results in real-world LiDAR-scanned situations. Our code will be released upon publication.
