Table of Contents
Fetching ...

DeFlow: Decoder of Scene Flow Network in Autonomous Driving

Qingwen Zhang, Yi Yang, Heng Fang, Ruoyu Geng, Patric Jensfelt

TL;DR

DeFlow addresses real-time scene flow estimation on large-scale LiDAR point clouds by augmenting voxel-based encoders with a GRU-based decoder that reconstructs point-level features. A novel three-way, motion-aware loss balances static and dynamic points, improving accuracy for dynamic regions while maintaining efficiency. Empirical results on Argoverse 2 demonstrate state-of-the-art End Point Error and dynamic point metrics, with four GRU refinement iterations offering the best trade-off between performance and resources. The approach enables real-time deployment and sets the stage for self-supervised and multi-sensor fusion extensions in autonomous driving.

Abstract

Scene flow estimation determines a scene's 3D motion field, by predicting the motion of points in the scene, especially for aiding tasks in autonomous driving. Many networks with large-scale point clouds as input use voxelization to create a pseudo-image for real-time running. However, the voxelization process often results in the loss of point-specific features. This gives rise to a challenge in recovering those features for scene flow tasks. Our paper introduces DeFlow which enables a transition from voxel-based features to point features using Gated Recurrent Unit (GRU) refinement. To further enhance scene flow estimation performance, we formulate a novel loss function that accounts for the data imbalance between static and dynamic points. Evaluations on the Argoverse 2 scene flow task reveal that DeFlow achieves state-of-the-art results on large-scale point cloud data, demonstrating that our network has better performance and efficiency compared to others. The code is open-sourced at https://github.com/KTH-RPL/deflow.

DeFlow: Decoder of Scene Flow Network in Autonomous Driving

TL;DR

DeFlow addresses real-time scene flow estimation on large-scale LiDAR point clouds by augmenting voxel-based encoders with a GRU-based decoder that reconstructs point-level features. A novel three-way, motion-aware loss balances static and dynamic points, improving accuracy for dynamic regions while maintaining efficiency. Empirical results on Argoverse 2 demonstrate state-of-the-art End Point Error and dynamic point metrics, with four GRU refinement iterations offering the best trade-off between performance and resources. The approach enables real-time deployment and sets the stage for self-supervised and multi-sensor fusion extensions in autonomous driving.

Abstract

Scene flow estimation determines a scene's 3D motion field, by predicting the motion of points in the scene, especially for aiding tasks in autonomous driving. Many networks with large-scale point clouds as input use voxelization to create a pseudo-image for real-time running. However, the voxelization process often results in the loss of point-specific features. This gives rise to a challenge in recovering those features for scene flow tasks. Our paper introduces DeFlow which enables a transition from voxel-based features to point features using Gated Recurrent Unit (GRU) refinement. To further enhance scene flow estimation performance, we formulate a novel loss function that accounts for the data imbalance between static and dynamic points. Evaluations on the Argoverse 2 scene flow task reveal that DeFlow achieves state-of-the-art results on large-scale point cloud data, demonstrating that our network has better performance and efficiency compared to others. The code is open-sourced at https://github.com/KTH-RPL/deflow.
Paper Structure (13 sections, 7 equations, 4 figures, 4 tables)

This paper contains 13 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: LiDAR scene flow estimation using our DeFlow method on the Argoverse 2. The predicted scene flow for each point is color-coded based on direction, with the color wheel anchored in the world frame. (a) Camera view for visualization purposes only. (b)(c) Estimated LiDAR point clouds' flow. Varied colors represent different directions, with more saturated colors indicating higher velocities. (b) Front view. (c) Bird's-eye view.
  • Figure 2: Histogram of moving distances in 0.1$\mathrm{~s}$ for all dynamic points across all scenes in the Argoverse 2 validation dataset (10$\mathrm{~Hz}$). The x-axis represents the distance in meters, ranging from 0.05 to 2.0 meters. The y-axis indicates the number of points for each distance range. The dynamic points are densely distributed within 0.2 meters.
  • Figure 3: DeFlow Architecture. The feature-extracting step, derived from PointPillars, takes two consecutive point clouds as input and transforms them into voxels. The encoder utilizes a convolutional U-Net backbone. Our novel decoder merges the encoder output with the point offset from PointPillars, employing a GRU for refinement. This process reconstructs the voxel-to-point information, ultimately producing the flow result.
  • Figure 4: Qualitative results from the validation dataset. The top row displays the ground truth flow, the middle row presents the FastFlow3D result, and the bottom row showcases the DeFlow outcomes. DeFlow estimates closely match the ground truth flow in both speed and angle. As highlighted in the two green circles, our DeFlow method demonstrates better performance in predicting motion angle (indicated by color variations) and speed (represented by color intensity) compared to FastFlow3D. The color wheel has been adjusted to align with the ego vehicle's forward direction.