Table of Contents
Fetching ...

SSRFlow: Semantic-aware Fusion with Spatial Temporal Re-embedding for Real-world Scene Flow

Zhiyang Lu, Qinghan Chen, Zhimin Yuan, Ming Cheng

TL;DR

SSRFlow tackles real-world scene flow estimation by addressing global semantic alignment, temporal-spatial distortions after warping, and domain gaps between synthetic and LiDAR data. It introduces a three-pronged architecture: (1) Global Fusion Flow Embedding with Dual Cross Attentive fusion to fuse and align semantic contexts across frames, (2) Spatial Temporal Re-embedding to re-encode spatiotemporal features after warping for refined residual flow, and (3) Domain Adaptive Losses to bridge synthetic-to-real motion inference. The method yields state-of-the-art results on FT3D, KITTI-based datasets, and LiDAR-centered benchmarks, with notable improvements in real-world scenarios and faster inference. These contributions enhance robustness and generalization for downstream dynamic scene understanding in autonomous systems.

Abstract

Scene flow, which provides the 3D motion field of the first frame from two consecutive point clouds, is vital for dynamic scene perception. However, contemporary scene flow methods face three major challenges. Firstly, they lack global flow embedding or only consider the context of individual point clouds before embedding, leading to embedded points struggling to perceive the consistent semantic relationship of another frame. To address this issue, we propose a novel approach called Dual Cross Attentive (DCA) for the latent fusion and alignment between two frames based on semantic contexts. This is then integrated into Global Fusion Flow Embedding (GF) to initialize flow embedding based on global correlations in both contextual and Euclidean spaces. Secondly, deformations exist in non-rigid objects after the warping layer, which distorts the spatiotemporal relation between the consecutive frames. For a more precise estimation of residual flow at next-level, the Spatial Temporal Re-embedding (STR) module is devised to update the point sequence features at current-level. Lastly, poor generalization is often observed due to the significant domain gap between synthetic and LiDAR-scanned datasets. We leverage novel domain adaptive losses to effectively bridge the gap of motion inference from synthetic to real-world. Experiments demonstrate that our approach achieves state-of-the-art (SOTA) performance across various datasets, with particularly outstanding results in real-world LiDAR-scanned situations. Our code will be released upon publication.

SSRFlow: Semantic-aware Fusion with Spatial Temporal Re-embedding for Real-world Scene Flow

TL;DR

SSRFlow tackles real-world scene flow estimation by addressing global semantic alignment, temporal-spatial distortions after warping, and domain gaps between synthetic and LiDAR data. It introduces a three-pronged architecture: (1) Global Fusion Flow Embedding with Dual Cross Attentive fusion to fuse and align semantic contexts across frames, (2) Spatial Temporal Re-embedding to re-encode spatiotemporal features after warping for refined residual flow, and (3) Domain Adaptive Losses to bridge synthetic-to-real motion inference. The method yields state-of-the-art results on FT3D, KITTI-based datasets, and LiDAR-centered benchmarks, with notable improvements in real-world scenarios and faster inference. These contributions enhance robustness and generalization for downstream dynamic scene understanding in autonomous systems.

Abstract

Scene flow, which provides the 3D motion field of the first frame from two consecutive point clouds, is vital for dynamic scene perception. However, contemporary scene flow methods face three major challenges. Firstly, they lack global flow embedding or only consider the context of individual point clouds before embedding, leading to embedded points struggling to perceive the consistent semantic relationship of another frame. To address this issue, we propose a novel approach called Dual Cross Attentive (DCA) for the latent fusion and alignment between two frames based on semantic contexts. This is then integrated into Global Fusion Flow Embedding (GF) to initialize flow embedding based on global correlations in both contextual and Euclidean spaces. Secondly, deformations exist in non-rigid objects after the warping layer, which distorts the spatiotemporal relation between the consecutive frames. For a more precise estimation of residual flow at next-level, the Spatial Temporal Re-embedding (STR) module is devised to update the point sequence features at current-level. Lastly, poor generalization is often observed due to the significant domain gap between synthetic and LiDAR-scanned datasets. We leverage novel domain adaptive losses to effectively bridge the gap of motion inference from synthetic to real-world. Experiments demonstrate that our approach achieves state-of-the-art (SOTA) performance across various datasets, with particularly outstanding results in real-world LiDAR-scanned situations. Our code will be released upon publication.
Paper Structure (29 sections, 18 equations, 13 figures, 9 tables)

This paper contains 29 sections, 18 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Illustration of the proposed network. Firstly, semantic features are hierarchically extracted and sent to GF to achieve global embedding between the two point clouds at the highest level. Then, the Flow Prediction (FP) module produces the initial scene flow. Subsequently, the flow and features are upsampled level by level, and the upsampled flow is accumulated onto the source frame by the warping layer. Afterwards, Spatial Temporal Re-embedding (STR) and Local Flow Embedding (LFE) are performed in turn, and FP yields the refined flow at a specific level.
  • Figure 2: Flowchart of global flow embedding. $\otimes$ and $\oplus$ denote multiplication and concatenation, respectively.
  • Figure 3: The details of STR module.
  • Figure 4: Comparisons of scene flow datasets, including (a) synthetic stereo, (b) real-world stereo, and (c)(d) real-world LiDAR-scanned. Blue and purple denote the source and target frames, respectively.
  • Figure 5: The visualization comparisons on KITTIs (first row) and FT3Ds (second row). The blue represents the source frame, and the green represents the result of warping the source frame using predictions. The red signifies incorrectly predicted warped points whose EPE3D $>$ 0.1m.
  • ...and 8 more figures