Table of Contents
Fetching ...

DeltaFlow: An Efficient Multi-frame Scene Flow Estimation Method

Qingwen Zhang, Xiaomeng Zhu, Yushan Zhang, Yixi Cai, Olov Andersson, Patric Jensfelt

TL;DR

DeltaFlow introduces a lightweight multi-frame scene flow method that uses a Temporal Δ Scheme to extract motion cues without expanding feature dimensionality as frames increase. It couples sparse voxel representations with a standard 3D backbone-decoder, and is guided by three losses (motion-awareness, category-balanced, instance-consistency) to address imbalance and object-level coherence. The approach achieves state-of-the-art results on Argoverse 2, Waymo, and nuScenes, with up to 22% lower dynamic error and up to 2x faster inference, while showing strong cross-domain generalization. The work provides open-source code and weights, highlighting its practical potential for real-time autonomous driving applications.

Abstract

Previous dominant methods for scene flow estimation focus mainly on input from two consecutive frames, neglecting valuable information in the temporal domain. While recent trends shift towards multi-frame reasoning, they suffer from rapidly escalating computational costs as the number of frames grows. To leverage temporal information more efficiently, we propose DeltaFlow ($Δ$Flow), a lightweight 3D framework that captures motion cues via a $Δ$ scheme, extracting temporal features with minimal computational cost, regardless of the number of frames. Additionally, scene flow estimation faces challenges such as imbalanced object class distributions and motion inconsistency. To tackle these issues, we introduce a Category-Balanced Loss to enhance learning across underrepresented classes and an Instance Consistency Loss to enforce coherent object motion, improving flow accuracy. Extensive evaluations on the Argoverse 2, Waymo and nuScenes datasets show that $Δ$Flow achieves state-of-the-art performance with up to 22% lower error and $2\times$ faster inference compared to the next-best multi-frame supervised method, while also demonstrating a strong cross-domain generalization ability. The code is open-sourced at https://github.com/Kin-Zhang/DeltaFlow along with trained model weights.

DeltaFlow: An Efficient Multi-frame Scene Flow Estimation Method

TL;DR

DeltaFlow introduces a lightweight multi-frame scene flow method that uses a Temporal Δ Scheme to extract motion cues without expanding feature dimensionality as frames increase. It couples sparse voxel representations with a standard 3D backbone-decoder, and is guided by three losses (motion-awareness, category-balanced, instance-consistency) to address imbalance and object-level coherence. The approach achieves state-of-the-art results on Argoverse 2, Waymo, and nuScenes, with up to 22% lower dynamic error and up to 2x faster inference, while showing strong cross-domain generalization. The work provides open-source code and weights, highlighting its practical potential for real-time autonomous driving applications.

Abstract

Previous dominant methods for scene flow estimation focus mainly on input from two consecutive frames, neglecting valuable information in the temporal domain. While recent trends shift towards multi-frame reasoning, they suffer from rapidly escalating computational costs as the number of frames grows. To leverage temporal information more efficiently, we propose DeltaFlow (Flow), a lightweight 3D framework that captures motion cues via a scheme, extracting temporal features with minimal computational cost, regardless of the number of frames. Additionally, scene flow estimation faces challenges such as imbalanced object class distributions and motion inconsistency. To tackle these issues, we introduce a Category-Balanced Loss to enhance learning across underrepresented classes and an Instance Consistency Loss to enforce coherent object motion, improving flow accuracy. Extensive evaluations on the Argoverse 2, Waymo and nuScenes datasets show that Flow achieves state-of-the-art performance with up to 22% lower error and faster inference compared to the next-best multi-frame supervised method, while also demonstrating a strong cross-domain generalization ability. The code is open-sourced at https://github.com/Kin-Zhang/DeltaFlow along with trained model weights.

Paper Structure

This paper contains 40 sections, 11 equations, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison of multi-frame strategies for scene flow estimation. For clarity, voxelized features are shown in dense formats. $X, Y, Z$ denote spatial resolution, $C$ represents feature channels, and $N$ is the number of frames. Existing methods process voxelized representations using (a) Concatenation features along the channel dimension zhang2024deflowfastflow3d; (b) 4D methods stack features in an additional temporal dimension kim2024flow4dmambaflow. Both increase input size as $N$ grows. (c) Our proposed $\Delta$Flow applies a $\Delta$ scheme between voxelized frame, maintaining a compact feature representation and a constant feature size independent of $N$.
  • Figure 2: Overview of the proposed $\Delta$Flow architecture. The framework first extracts point-level features and voxelize them to obtain sparse voxel features $\mathscr{D}$. The core temporal $\Delta$ scheme then computes the difference between the current frame $t$ and previous frames $(t-1,\dots,t-N)$, weighted by a time-decay factor $\lambda$. The resulting $\Delta$ feature $\mathscr{D}_{\text{delta}}$ is then passed to a 3D backbone–decoder network to estimate the final scene flow $\Delta\mathcal{F}_{t-1}$. This approach captures motion-specific cues efficiently while keeping the architecture compact and scalable, regardless of the number of frames.
  • Figure 2: Comparisons on Waymo validation set where each sequence contains around 200 frames. Upper groups are self-supervised methods, lower are supervised methods.
  • Figure 3: Comparison of scene flow ground truth before and after motion compensation on a high-speed car. Blue points: LiDAR scan at $t+1$; Green points: LiDAR scan at $t$; Red lines: annotated flow vectors.
  • Figure 4: Scalability comparison of multi-frame scene flow estimation on the Argoverse 2 validation set. '#F' denotes the number of input frames processed. Flow4D and our $\Delta$Flow are evaluated across increasing frame counts, reporting relative training speed, memory usage, and bucket-normalized accuracy.
  • ...and 8 more figures