SSF: Sparse Long-Range Scene Flow for Autonomous Driving
Ajinkya Khoche, Qingwen Zhang, Laura Pereira Sanchez, Aron Asefaw, Sina Sharif Mansouri, Patric Jensfelt
TL;DR
SSF targets long-range 3D scene flow for autonomous driving by replacing dense grids with a sparse convolutional backbone to improve scalability. It introduces a sparse feature fusion mechanism using virtual voxels to preserve time-step ordering, enabling effective cross-frame feature aggregation, and a range-aware evaluation to stress-test performance across distance. The method achieves state-of-the-art results on Argoverse2 with reduced memory and runtime, and demonstrates robustness to increasing perception range. This work establishes a practical baseline for long-range motion understanding and sets the stage for self-supervised extensions and integration into downstream perception tasks. $\hat{\mathcal{F}}_{t,t+1} = \mathcal{F}_{ego} + \Delta \hat{\mathcal{F}}_{t,t+1}$ and $\mathrm{EPE}(\mathcal{P}_t) = \frac{1}{|\mathcal{P}_t|} \sum_{p \in \mathcal{P}_t} \| \hat{\mathcal{F}}_{t,t+1}(p) - \mathcal{F}^*_{t,t+1}(p) \|_2$ formalize the core objective and evaluation.
Abstract
Scene flow enables an understanding of the motion characteristics of the environment in the 3D world. It gains particular significance in the long-range, where object-based perception methods might fail due to sparse observations far away. Although significant advancements have been made in scene flow pipelines to handle large-scale point clouds, a gap remains in scalability with respect to long-range. We attribute this limitation to the common design choice of using dense feature grids, which scale quadratically with range. In this paper, we propose Sparse Scene Flow (SSF), a general pipeline for long-range scene flow, adopting a sparse convolution based backbone for feature extraction. This approach introduces a new challenge: a mismatch in size and ordering of sparse feature maps between time-sequential point scans. To address this, we propose a sparse feature fusion scheme, that augments the feature maps with virtual voxels at missing locations. Additionally, we propose a range-wise metric that implicitly gives greater importance to faraway points. Our method, SSF, achieves state-of-the-art results on the Argoverse2 dataset, demonstrating strong performance in long-range scene flow estimation. Our code will be released at https://github.com/KTH-RPL/SSF.git.
