Table of Contents
Fetching ...

Fast Kernel Scene Flow

Xueqian Li, Simon Lucey

TL;DR

This work proposes a new positional encoding-based kernel that demonstrates state-of-the-art performance in efficient lidar scene flow estimation on large-scale point clouds, enabling a variety of practical applications in robotics and autonomous driving scenarios.

Abstract

In contrast to current state-of-the-art methods, such as NSFP [25], which employ deep implicit neural functions for modeling scene flow, we present a novel approach that utilizes classical kernel representations. This representation enables our approach to effectively handle dense lidar points while demonstrating exceptional computational efficiency -- compared to recent deep approaches -- achieved through the solution of a linear system. As a runtime optimization-based method, our model exhibits impressive generalizability across various out-of-distribution scenarios, achieving competitive performance on large-scale lidar datasets. We propose a new positional encoding-based kernel that demonstrates state-of-the-art performance in efficient lidar scene flow estimation on large-scale point clouds. An important highlight of our method is its near real-time performance (~150-170 ms) with dense lidar data (~8k-144k points), enabling a variety of practical applications in robotics and autonomous driving scenarios.

Fast Kernel Scene Flow

TL;DR

This work proposes a new positional encoding-based kernel that demonstrates state-of-the-art performance in efficient lidar scene flow estimation on large-scale point clouds, enabling a variety of practical applications in robotics and autonomous driving scenarios.

Abstract

In contrast to current state-of-the-art methods, such as NSFP [25], which employ deep implicit neural functions for modeling scene flow, we present a novel approach that utilizes classical kernel representations. This representation enables our approach to effectively handle dense lidar points while demonstrating exceptional computational efficiency -- compared to recent deep approaches -- achieved through the solution of a linear system. As a runtime optimization-based method, our model exhibits impressive generalizability across various out-of-distribution scenarios, achieving competitive performance on large-scale lidar datasets. We propose a new positional encoding-based kernel that demonstrates state-of-the-art performance in efficient lidar scene flow estimation on large-scale point clouds. An important highlight of our method is its near real-time performance (~150-170 ms) with dense lidar data (~8k-144k points), enabling a variety of practical applications in robotics and autonomous driving scenarios.
Paper Structure (23 sections, 25 equations, 8 figures, 3 tables)

This paper contains 23 sections, 25 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Current point cloud-based scene flow methods can be summarized based on three key properties: (1) whether they are feed-forward learning or runtime optimization (x-axis); (2) whether they leverage point features or not (shaded or empty legends); (3) whether they can be applied in real-time, or are preferable for offline applications (y-axis). Feed-forward learning, such as FLOT puy20flot and R3DSF gojcic2021weakly, learn features from data during training. Typically, these methods are applied to sparse, small-scale datasets (8,192 points), and exhibit inferior performance when tested on out-of-distribution data. In contrast, the runtime optimization-based method NSFP li2021neural is dominated in dense lidar flow estimation but suffers from extremely slow computation (8.38s). FastNSF li2023fast addresses this issue and achieves up to 30 times speedups (0.51s). As a hybrid method, SCOOP lang2023scoop learns features for point correspondence but still faces computational inefficiency (7.63s). Our method integrates a per-point embedding-based feature within a kernel representation that solves a linear system, achieving near real-time performance (0.169s) with an end-point error of 0.081 on dense lidar points.
  • Figure 2: Framework of PPE kernel scene flow. With input point cloud $\mathcal{S}_1$, $\mathcal{S}_2$, we could use different approaches to extract per-point features. For example: (a) Raw points, which treat original points as point features; (b) PE, which uses an RFF-based positional encoding to embed the input point to high-frequency features; (c) PEAT, which extracts learned point features using positional encoding and self-attention. A kernel function $\mathcal{K}$ is then employed to compute the similarity matrix between these two inputs based on their point features. Finally, a linear coefficient vector $\bm{\alpha}$ is optimized per sample to predict the final flow. Our model is compact and fast, with only $\bm{\alpha}$ being the learnable parameter. The PPE features can either be pre-trained offline or analytical positional encodings.
  • Figure 3: The red points indicate raw point cloud while the cyan points represent supporting grid points. $\mathbf{p}$ and $\mathbf{p}^*$ do not indicate necessary correspondence between them. For example, \ref{['eq:rbf_kernel']} describes a kernel function of size $N \,{\times}\, M$, where $N$ is the number of points in the source point cloud, and $M$ is the number of grid points.
  • Figure 4: Visual results demonstrate the effectiveness of our method in comparison to FastNSF li2023fast and SCOOP lang2023scoop on two examples from Argoverse and Waymo Open scene flow dataset. The 3D scene flow is presented in a projected 2D view for clearer illustration. Zoom-in details are shown for boxed areas. In the upper left corner, a color wheel is used to indicate the projected flow magnitude (color intensity) and flow direction (angle). Our method shows great visual results on complicated dynamic AV scenes, successfully capturing both the rigid pose and the dynamic motions. FastNSF struggles to capture multiple dynamic objects in some cases, leading to occasional noisy results. SCOOP, on the other hand, cannot scale up well to dense points, resulting in near-rigid estimation for most of the scenes.
  • Figure 5: Limitations of our method. In this example from Argoverse, while most predicted motions are smooth and accurate, the dynamic cars highlighted in the red box are inaccurately predicted as rigid, similar to the background motion. Additionally, the predicted background motion, marked by the red circle, exhibits extra noise.
  • ...and 3 more figures