Table of Contents
Fetching ...

Self-Supervised Multi-Frame Neural Scene Flow

Dongrui Liu, Daqi Liu, Xueqian Li, Sihao Lin, Hongwei xie, Bing Wang, Xiaojun Chang, Lei Chu

TL;DR

This work investigates why neural scene flow methods generalize well to large, open-world lidar data and reveals a uniform-stability based generalization bound for NSFP that improves as the number of input point clouds grows. Building on this theory, the authors propose a simple multi-frame scheme that jointly leverages forward and backward flows from three consecutive frames via a motion inverter and a temporal fusion module, and they provide a theoretical bound showing this approach preserves generalization. The method achieves state-of-the-art results on Waymo Open and Argoverse without supervision, and ablations demonstrate the necessity of each component, while case studies highlight robustness to fast motion. Overall, the paper offers both theoretical guarantees and a practical, effective multi-frame strategy for dense, real-world 3D scene flow estimation with large-scale point clouds.

Abstract

Neural Scene Flow Prior (NSFP) and Fast Neural Scene Flow (FNSF) have shown remarkable adaptability in the context of large out-of-distribution autonomous driving. Despite their success, the underlying reasons for their astonishing generalization capabilities remain unclear. Our research addresses this gap by examining the generalization capabilities of NSFP through the lens of uniform stability, revealing that its performance is inversely proportional to the number of input point clouds. This finding sheds light on NSFP's effectiveness in handling large-scale point cloud scene flow estimation tasks. Motivated by such theoretical insights, we further explore the improvement of scene flow estimation by leveraging historical point clouds across multiple frames, which inherently increases the number of point clouds. Consequently, we propose a simple and effective method for multi-frame point cloud scene flow estimation, along with a theoretical evaluation of its generalization abilities. Our analysis confirms that the proposed method maintains a limited generalization error, suggesting that adding multiple frames to the scene flow optimization process does not detract from its generalizability. Extensive experimental results on large-scale autonomous driving Waymo Open and Argoverse lidar datasets demonstrate that the proposed method achieves state-of-the-art performance.

Self-Supervised Multi-Frame Neural Scene Flow

TL;DR

This work investigates why neural scene flow methods generalize well to large, open-world lidar data and reveals a uniform-stability based generalization bound for NSFP that improves as the number of input point clouds grows. Building on this theory, the authors propose a simple multi-frame scheme that jointly leverages forward and backward flows from three consecutive frames via a motion inverter and a temporal fusion module, and they provide a theoretical bound showing this approach preserves generalization. The method achieves state-of-the-art results on Waymo Open and Argoverse without supervision, and ablations demonstrate the necessity of each component, while case studies highlight robustness to fast motion. Overall, the paper offers both theoretical guarantees and a practical, effective multi-frame strategy for dense, real-world 3D scene flow estimation with large-scale point clouds.

Abstract

Neural Scene Flow Prior (NSFP) and Fast Neural Scene Flow (FNSF) have shown remarkable adaptability in the context of large out-of-distribution autonomous driving. Despite their success, the underlying reasons for their astonishing generalization capabilities remain unclear. Our research addresses this gap by examining the generalization capabilities of NSFP through the lens of uniform stability, revealing that its performance is inversely proportional to the number of input point clouds. This finding sheds light on NSFP's effectiveness in handling large-scale point cloud scene flow estimation tasks. Motivated by such theoretical insights, we further explore the improvement of scene flow estimation by leveraging historical point clouds across multiple frames, which inherently increases the number of point clouds. Consequently, we propose a simple and effective method for multi-frame point cloud scene flow estimation, along with a theoretical evaluation of its generalization abilities. Our analysis confirms that the proposed method maintains a limited generalization error, suggesting that adding multiple frames to the scene flow optimization process does not detract from its generalizability. Extensive experimental results on large-scale autonomous driving Waymo Open and Argoverse lidar datasets demonstrate that the proposed method achieves state-of-the-art performance.
Paper Structure (17 sections, 3 theorems, 36 equations, 5 figures, 7 tables)

This paper contains 17 sections, 3 theorems, 36 equations, 5 figures, 7 tables.

Key Result

Lemma 1

Bregman divergence is non-negative and additive. For example, give some convex functions $F_1$, $F_2$ and $F=F_1+F_2$, for any $g,h\in \mathcal{H}$, we have and

Figures (5)

  • Figure 1: Current learning-based point cloud scene flow methods liu2019flownet3dliu2019meteornetwang2022matterszhang2023gmsfpeng2023delflow are trained on synthetic datasets and fail to generalize to realistic autonomous driving scenarios. Fortunately, FNSF li2023fast shows powerful generalization ability in large lidar autonomous driving scenes. However, none of these studies exploit the useful temporal information from previous point cloud frames. Extensive studies on optical flow estimation wulff2017opticalgolyanik2017multiframejanai2018unsupervisedmaurer2018proflowliu2019selflowstone2021smurfhur2021selfmehl2023m and (a) have shown that scene flow in consecutive frames are similar to each other (i.e., the upper left color wheel represents the flow magnitude and direction). To this end, an intuitive approach for exploiting temporal information, namely Joint, is to force a single FNSF to jointly estimate the previous flow ($t\text{-}1 \,{\rightarrow}\, t$) and the current flow ($t \,{\rightarrow}\, t\text{+}1$). (b) shows that such an intuitive multi-frame scheme achieves worse performance than two-frame FNSF on the Waymo Open dataset. In this paper, we are the first to propose a simple and effective multi-frame point cloud scene flow estimation scheme. (c) shows that the proposed method achieves state-of-the-art on the Waymo Open dataset. For better visualization, different metrics are separately normalized. Please see Section \ref{['sc:exp']} for more discussions about evaluation metrics.
  • Figure 2: Overview of the proposed multi-frame point cloud scene flow estimation scheme. Given three consecutive frames ($t\text{-}1$, $t$, and $t\text{+}1$), we aim to estimate the scene flow from frame $t$ to frame $t\text{+}1$. Specifically, we use two models $g_f \left(\cdot \, ;\: \mathbf{\Theta_f} \right)$ and $g_b \left(\cdot \, ;\: \mathbf{\Theta_b} \right)$ to predict the forward scene flow $\mathcal{F}_2$ ($t \,{\rightarrow}\, t\text{+}1$) and the backward scene flow $\mathcal{B}_2$ ($t \,{\rightarrow}\, t\text{-}1$), respectively. Furthermore, a motion inverter $g_{\rm invert}$ and a temporal fusion model $g_{\rm fusion} \left(\cdot \, ; \mathbf{\Theta_{\rm fusion}} \right)$ are used to estimate the fused scene flow. The upper left color wheel in the fused scene flow represents the flow magnitude and direction.
  • Figure 3: Visual comparison between FNSF and the proposed method on the Argoverse dataset. For each point, color represents the normalized 3D end-point error $\mathcal{E}$. In this way, blue indicates the estimation of the flow is accurate. The detailed view demonstrates two point clouds aligned by the estimated flow.
  • Figure 4: Fast motion cases on the Argoverse and the Waymo Open datasets. Color represents the normalized 3D end-point error $\mathcal{E}$ for each point. In other words, blue indicates the estimation of the flow is accurate.
  • Figure 5: The loss landscapes of FNSF and the proposed method on the Argoverse dataset. Color represents the testing loss. The proposed method eases the scene flow optimization process and has a more flat minimum.

Theorems & Definitions (9)

  • Definition 1
  • Definition 2
  • Lemma 1
  • Definition 3
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Remark 1