Self-Supervised Multi-Frame Neural Scene Flow
Dongrui Liu, Daqi Liu, Xueqian Li, Sihao Lin, Hongwei xie, Bing Wang, Xiaojun Chang, Lei Chu
TL;DR
This work investigates why neural scene flow methods generalize well to large, open-world lidar data and reveals a uniform-stability based generalization bound for NSFP that improves as the number of input point clouds grows. Building on this theory, the authors propose a simple multi-frame scheme that jointly leverages forward and backward flows from three consecutive frames via a motion inverter and a temporal fusion module, and they provide a theoretical bound showing this approach preserves generalization. The method achieves state-of-the-art results on Waymo Open and Argoverse without supervision, and ablations demonstrate the necessity of each component, while case studies highlight robustness to fast motion. Overall, the paper offers both theoretical guarantees and a practical, effective multi-frame strategy for dense, real-world 3D scene flow estimation with large-scale point clouds.
Abstract
Neural Scene Flow Prior (NSFP) and Fast Neural Scene Flow (FNSF) have shown remarkable adaptability in the context of large out-of-distribution autonomous driving. Despite their success, the underlying reasons for their astonishing generalization capabilities remain unclear. Our research addresses this gap by examining the generalization capabilities of NSFP through the lens of uniform stability, revealing that its performance is inversely proportional to the number of input point clouds. This finding sheds light on NSFP's effectiveness in handling large-scale point cloud scene flow estimation tasks. Motivated by such theoretical insights, we further explore the improvement of scene flow estimation by leveraging historical point clouds across multiple frames, which inherently increases the number of point clouds. Consequently, we propose a simple and effective method for multi-frame point cloud scene flow estimation, along with a theoretical evaluation of its generalization abilities. Our analysis confirms that the proposed method maintains a limited generalization error, suggesting that adding multiple frames to the scene flow optimization process does not detract from its generalizability. Extensive experimental results on large-scale autonomous driving Waymo Open and Argoverse lidar datasets demonstrate that the proposed method achieves state-of-the-art performance.
