TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation

Qingwen Zhang; Chenhan Jiang; Xiaomeng Zhu; Yunqi Miao; Yushan Zhang; Olov Andersson; Patric Jensfelt

TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation

Qingwen Zhang, Chenhan Jiang, Xiaomeng Zhu, Yunqi Miao, Yushan Zhang, Olov Andersson, Patric Jensfelt

TL;DR

TeFlow introduces a temporal ensembling strategy that forms reliable supervisory signals by aggregating the most temporally consistent motion cues from a candidate pool built across multiple frames, establishing a new state-of-the-art for self-supervised feed-forward methods.

Abstract

Self-supervised feed-forward methods for scene flow estimation offer real-time efficiency, but their supervision from two-frame point correspondences is unreliable and often breaks down under occlusions. Multi-frame supervision has the potential to provide more stable guidance by incorporating motion cues from past frames, yet naive extensions of two-frame objectives are ineffective because point correspondences vary abruptly across frames, producing inconsistent signals. In the paper, we present TeFlow, enabling multi-frame supervision for feed-forward models by mining temporally consistent supervision. TeFlow introduces a temporal ensembling strategy that forms reliable supervisory signals by aggregating the most temporally consistent motion cues from a candidate pool built across multiple frames. Extensive evaluations demonstrate that TeFlow establishes a new state-of-the-art for self-supervised feed-forward methods, achieving performance gains of up to 33\% on the challenging Argoverse 2 and nuScenes datasets. Our method performs on par with leading optimization-based methods, yet speeds up 150 times. The code is open-sourced at https://github.com/KTH-RPL/OpenSceneFlow along with trained model weights.

TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation

TL;DR

Abstract

Paper Structure (20 sections, 10 equations, 8 figures, 10 tables)

This paper contains 20 sections, 10 equations, 8 figures, 10 tables.

Introduction
Related Work
Preliminaries
Method: TeFlow
Temporal Ensembling for Dynamic Clusters
Training Objective
Implementation Details
Experiments
State-of-the-art Comparison
Ablation Studies
Qualitative results
Conclusion
Acknowledgement
Datasets Description
Additional Quantitative Analysis
...and 5 more sections

Figures (8)

Figure 1: (a) Multi-frame supervision maintains more stable guidance during occlusion by querying past frames, while two-frame supervision fails due to missing points. (b) Direction change of supervisory signals over time, reflecting their temporal consistency. The two-frame supervision zhang2024seflow exhibits abrupt variations with frequent direction shifts, while our five-frame TeFlow produces more stable signals that stay closer to the ground truth.
Figure 2: Accuracy vs. Runtime. Prior feed-forward methods are fast but less accurate, while optimization-based methods are accurate but too slow. TeFlow achieves both real-time speed and high accuracy.
Figure 3: An overview of the TeFlow, a multi-frame feedforward scene flow estimation pipeline, shown in the top row. Our self-supervised pipeline tackles the main challenge of deriving reliable supervision $\bar{\textbf{f}}$ from dense multi-frame inputs. For each dynamic cluster $\mathcal{C}_j$, we constructs a motion candidate pool (internal $\hat{\textbf{f}}_{\mathcal{C}_j}$ and external $\mathbf f_{\mathcal{C}_j, k}$). Candidates are processed via a weighted consensus voting scheme using directional consistency $\mathbf{M}$ and magnitude-based reliability $\mathbf{w}$ to find a consensus winner (\ref{['eq:voting']}). The final supervision $\bar{\mathbf{f}}_{\mathcal{C}_j}$ is a weighted average of the winner and agreeing candidates, which filters inconsistent outliers (e.g., $\hat{\textbf{f}}_{\mathcal{C}_j,1}$) for stable training.
Figure 4: Qualitative results on Argoverse 2 (left) and nuScenes (right). Rows show ground truth, SeFlow, and TeFlow predictions across time. Scene flow is visualized with hue indicating direction and saturation representing speed. Compared to SeFlow, TeFlow produces flow estimates that are more accurate and temporally consistent, particularly for dynamic objects (red circles).
Figure 5: Qualitative comparisons on the Argoverse 2 validation set. Left: A multi-vehicle scene. Right: A vehicle stopping for pedestrians. Our method robustly handles both scenarios, unlike the baseline. (Best viewed in color.) The scenes correspond to scene IDs 'c85a88a8-c916-30a7-923c-0c66bd3ebbd3' and 'b6500255-eba3-3f77-acfd-626c07aa8621'.
...and 3 more figures

TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation

TL;DR

Abstract

TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)