Table of Contents
Fetching ...

Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation

David T. Hoffmann, Syed Haseeb Raza, Hanqiu Jiang, Denis Tananaev, Steffen Klingenhoefer, Martin Meinke

TL;DR

The paper addresses unsupervised scene flow estimation from LiDAR by reducing the heavy runtime of test-time optimization while maintaining high accuracy. It replaces the traditional MLP-based implicit representation with a simple 3D voxel grid, and couples this with a multi-frame distance-transform loss, a cluster-consistency constraint, and a flow-norm regularizer to handle occlusions, misassociations, and static regions. Floxels achieves competitive performance to EulerFlow but with 60–140× faster runtimes, and outperforms fast baselines and some supervised methods on dynamic points across benchmarks. This approach enhances robustness to occlusion and domain variations and scales efficiently to large point clouds, making unsupervised scene flow more practical for real-time robotic applications.

Abstract

Scene flow estimation is a foundational task for many robotic applications, including robust dynamic object detection, automatic labeling, and sensor synchronization. Two types of approaches to the problem have evolved: 1) Supervised and 2) optimization-based methods. Supervised methods are fast during inference and achieve high-quality results, however, they are limited by the need for large amounts of labeled training data and are susceptible to domain gaps. In contrast, unsupervised test-time optimization methods do not face the problem of domain gaps but usually suffer from substantial runtime, exhibit artifacts, or fail to converge to the right solution. In this work, we mitigate several limitations of existing optimization-based methods. To this end, we 1) introduce a simple voxel grid-based model that improves over the standard MLP-based formulation in multiple dimensions and 2) introduce a new multiframe loss formulation. 3) We combine both contributions in our new method, termed Floxels. On the Argoverse 2 benchmark, Floxels is surpassed only by EulerFlow among unsupervised methods while achieving comparable performance at a fraction of the computational cost. Floxels achieves a massive speedup of more than ~60 - 140x over EulerFlow, reducing the runtime from a day to 10 minutes per sequence. Over the faster but low-quality baseline, NSFP, Floxels achieves a speedup of ~14x.

Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation

TL;DR

The paper addresses unsupervised scene flow estimation from LiDAR by reducing the heavy runtime of test-time optimization while maintaining high accuracy. It replaces the traditional MLP-based implicit representation with a simple 3D voxel grid, and couples this with a multi-frame distance-transform loss, a cluster-consistency constraint, and a flow-norm regularizer to handle occlusions, misassociations, and static regions. Floxels achieves competitive performance to EulerFlow but with 60–140× faster runtimes, and outperforms fast baselines and some supervised methods on dynamic points across benchmarks. This approach enhances robustness to occlusion and domain variations and scales efficiently to large point clouds, making unsupervised scene flow more practical for real-time robotic applications.

Abstract

Scene flow estimation is a foundational task for many robotic applications, including robust dynamic object detection, automatic labeling, and sensor synchronization. Two types of approaches to the problem have evolved: 1) Supervised and 2) optimization-based methods. Supervised methods are fast during inference and achieve high-quality results, however, they are limited by the need for large amounts of labeled training data and are susceptible to domain gaps. In contrast, unsupervised test-time optimization methods do not face the problem of domain gaps but usually suffer from substantial runtime, exhibit artifacts, or fail to converge to the right solution. In this work, we mitigate several limitations of existing optimization-based methods. To this end, we 1) introduce a simple voxel grid-based model that improves over the standard MLP-based formulation in multiple dimensions and 2) introduce a new multiframe loss formulation. 3) We combine both contributions in our new method, termed Floxels. On the Argoverse 2 benchmark, Floxels is surpassed only by EulerFlow among unsupervised methods while achieving comparable performance at a fraction of the computational cost. Floxels achieves a massive speedup of more than ~60 - 140x over EulerFlow, reducing the runtime from a day to 10 minutes per sequence. Over the faster but low-quality baseline, NSFP, Floxels achieves a speedup of ~14x.

Paper Structure

This paper contains 25 sections, 5 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Examples of scene flow from Fast Neural Scene Flow (FNSF) li2023fast (top) and our method Floxels (bottom). We show point cloud $t_1$ (Purple), point cloud $t_2$ (blue) and the estimated scene flow (orange) in Fig. \ref{['fig:teaser_near_points']}, \ref{['fig:teaser_opp_dir']}, \ref{['fig:teaser_occlusion']}. Fig. \ref{['fig:teaser_near_points']}: FNSF tends to predict flow to closest points. Floxels uses neighboring points to escape such local minima. Fig. \ref{['fig:teaser_occlusion']}: A static wall occluded by trees. FNSF predicts flow in the occluded region (red circles) to the nearest points. Floxels overcomes this by making use of multiple scans. Fig. \ref{['fig:teaser_flow_field']}: Birds-eye view on the flow field. FNSF displays wrong flow patterns in regions without objects. Floxels correctly predicts zero flow for these regions.
  • Figure 2: Floxel components. \ref{['fig:meth_voxel_grid']}Voxel Grid. Instead of an MLP, we use a simple grid to represent the motion of the points. In each voxel (here depicted as vertex), we learn the x,y, and z velocities. For each point (blue), the flow is calculated via trilinear interpolation from the neighboring vertices (connected via yellow lines). The final motion is predicted using trilinear interpolation. \ref{['fig:rigid_body']}Cluster consistency loss. We encourage points of the same cluster to have a similar flow. \ref{['fig:meth_constant_velo']}Multi-scan Distance Transform loss. To estimate the motion of points at time $t$ we not only rely on the points at $t_{1}$, but also other close-by time points. Thus, even if matching points are missing (car mirror at $t+1$) the flow can be estimated correctly using $t-1$ or more general $t \pm m$.
  • Figure 3: Argoverse 2 (2024) Scene Flow Challenge test set. Mean Dynamic Normalized EPE of Floxels compared to prior art. We report Floxels results for sequence lengths 5, 9, and 13. Supervised methods are shown with hatching. Floxels performs almost as well as EulerFlow, despite requiring only a fraction of computational resources. We show these results also in \ref{['tab:bucket_val_full']}.
  • Figure 4: Per-class Dynamic Normalized EPE on Argoverse 2 (2024) Scene Flow Challenge test set. Supervised methods are shown with hatching. Bars are ordered from left to right by increasing mean Dynamic Normalized EPE.
  • Figure 5: Left: Birds-eye view of the flow field. A Truck passing behind a traffic light. The neural prior leads to a prediction of false flow in empty regions (windmill artifacts), and no flow is predicted for occluded regions. Windmill artifacts do not contribute to the loss metrics in regions without actual points, resulting in an overestimated performance for MLP-based methods. Right: Accumulated point clouds projected to the camera. Windmill artifacts of the MLP lead to points of the static car being falsely shifted to the left during lidar-to-camera synchronization. Floxels are not susceptible to this failure mode.
  • ...and 5 more figures