Table of Contents
Fetching ...

SSF: Sparse Long-Range Scene Flow for Autonomous Driving

Ajinkya Khoche, Qingwen Zhang, Laura Pereira Sanchez, Aron Asefaw, Sina Sharif Mansouri, Patric Jensfelt

TL;DR

SSF targets long-range 3D scene flow for autonomous driving by replacing dense grids with a sparse convolutional backbone to improve scalability. It introduces a sparse feature fusion mechanism using virtual voxels to preserve time-step ordering, enabling effective cross-frame feature aggregation, and a range-aware evaluation to stress-test performance across distance. The method achieves state-of-the-art results on Argoverse2 with reduced memory and runtime, and demonstrates robustness to increasing perception range. This work establishes a practical baseline for long-range motion understanding and sets the stage for self-supervised extensions and integration into downstream perception tasks. $\hat{\mathcal{F}}_{t,t+1} = \mathcal{F}_{ego} + \Delta \hat{\mathcal{F}}_{t,t+1}$ and $\mathrm{EPE}(\mathcal{P}_t) = \frac{1}{|\mathcal{P}_t|} \sum_{p \in \mathcal{P}_t} \| \hat{\mathcal{F}}_{t,t+1}(p) - \mathcal{F}^*_{t,t+1}(p) \|_2$ formalize the core objective and evaluation.

Abstract

Scene flow enables an understanding of the motion characteristics of the environment in the 3D world. It gains particular significance in the long-range, where object-based perception methods might fail due to sparse observations far away. Although significant advancements have been made in scene flow pipelines to handle large-scale point clouds, a gap remains in scalability with respect to long-range. We attribute this limitation to the common design choice of using dense feature grids, which scale quadratically with range. In this paper, we propose Sparse Scene Flow (SSF), a general pipeline for long-range scene flow, adopting a sparse convolution based backbone for feature extraction. This approach introduces a new challenge: a mismatch in size and ordering of sparse feature maps between time-sequential point scans. To address this, we propose a sparse feature fusion scheme, that augments the feature maps with virtual voxels at missing locations. Additionally, we propose a range-wise metric that implicitly gives greater importance to faraway points. Our method, SSF, achieves state-of-the-art results on the Argoverse2 dataset, demonstrating strong performance in long-range scene flow estimation. Our code will be released at https://github.com/KTH-RPL/SSF.git.

SSF: Sparse Long-Range Scene Flow for Autonomous Driving

TL;DR

SSF targets long-range 3D scene flow for autonomous driving by replacing dense grids with a sparse convolutional backbone to improve scalability. It introduces a sparse feature fusion mechanism using virtual voxels to preserve time-step ordering, enabling effective cross-frame feature aggregation, and a range-aware evaluation to stress-test performance across distance. The method achieves state-of-the-art results on Argoverse2 with reduced memory and runtime, and demonstrates robustness to increasing perception range. This work establishes a practical baseline for long-range motion understanding and sets the stage for self-supervised extensions and integration into downstream perception tasks. and formalize the core objective and evaluation.

Abstract

Scene flow enables an understanding of the motion characteristics of the environment in the 3D world. It gains particular significance in the long-range, where object-based perception methods might fail due to sparse observations far away. Although significant advancements have been made in scene flow pipelines to handle large-scale point clouds, a gap remains in scalability with respect to long-range. We attribute this limitation to the common design choice of using dense feature grids, which scale quadratically with range. In this paper, we propose Sparse Scene Flow (SSF), a general pipeline for long-range scene flow, adopting a sparse convolution based backbone for feature extraction. This approach introduces a new challenge: a mismatch in size and ordering of sparse feature maps between time-sequential point scans. To address this, we propose a sparse feature fusion scheme, that augments the feature maps with virtual voxels at missing locations. Additionally, we propose a range-wise metric that implicitly gives greater importance to faraway points. Our method, SSF, achieves state-of-the-art results on the Argoverse2 dataset, demonstrating strong performance in long-range scene flow estimation. Our code will be released at https://github.com/KTH-RPL/SSF.git.

Paper Structure

This paper contains 16 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: A plot of mean dynamic normalized EPE khatri2024can against the frame rate of inference for the validation set of Argoverse2 sensor dataset. $\blacktriangle$ and $\times$ represent the size of a square grid centered around ego-vehicle, which is used for feature extraction and training of the neural network. As the grid size is increased, the inference memory of dense based methods increases, as indicated by the size of translucent circles around the markers. Our method, SSF demonstrates state-of-the-art performance while maintaining low memory and runtime.
  • Figure 2: Schematic of our SSF model. The network takes as input point clouds at time t and t+1, shown in blue and green respectively. The first step involves voxelizing the space into pillars and computing sparse voxel feature encoding (VFE) features $E^s_t$ and $E^s_{t+1}$. Here, $M_t$ and $M_{t+1}$ denote the masks of voxels occupied by the point clouds at times t and t+1. To facilitate fusion by concatenation, the VFE feature maps are augmented with virtual voxels at locations defined by the set differences $M_{t+1} \setminus M_t$ and $M_t \setminus M_{t+1}$. The concatenated sparse feature maps are then processed using a sparse U-Net autoencoder. Finally, a linear decoder combines the encoder's output with sparse feature maps and point-wise offsets to generate per-point scene flow, represented by the arrows in the rightmost image.
  • Figure 3: Plots of inference memory and runtime against perception range.