ZeroFlow: Scalable Scene Flow via Distillation

Kyle Vedder; Neehar Peri; Nathaniel Chodosh; Ishan Khatri; Eric Eaton; Dinesh Jayaraman; Yang Liu; Deva Ramanan; James Hays

ZeroFlow: Scalable Scene Flow via Distillation

Kyle Vedder, Neehar Peri, Nathaniel Chodosh, Ishan Khatri, Eric Eaton, Dinesh Jayaraman, Yang Liu, Deva Ramanan, James Hays

TL;DR

This work proposes Scene Flow via Distillation, a simple, scalable distillation framework that uses a label-free optimization method to produce pseudo-labels to supervise a feedforward model, and achieves state-of-the-art performance on the Argoverse 2 Self-Supervised Scene Flow Challenge while using zero human labels.

Abstract

Scene flow estimation is the task of describing the 3D motion field between temporally successive point clouds. State-of-the-art methods use strong priors and test-time optimization techniques, but require on the order of tens of seconds to process full-size point clouds, making them unusable as computer vision primitives for real-time applications such as open world object detection. Feedforward methods are considerably faster, running on the order of tens to hundreds of milliseconds for full-size point clouds, but require expensive human supervision. To address both limitations, we propose Scene Flow via Distillation, a simple, scalable distillation framework that uses a label-free optimization method to produce pseudo-labels to supervise a feedforward model. Our instantiation of this framework, ZeroFlow, achieves state-of-the-art performance on the Argoverse 2 Self-Supervised Scene Flow Challenge while using zero human labels by simply training on large-scale, diverse unlabeled data. At test-time, ZeroFlow is over 1000x faster than label-free state-of-the-art optimization-based methods on full-size point clouds (34 FPS vs 0.028 FPS) and over 1000x cheaper to train on unlabeled data compared to the cost of human annotation (\$394 vs ~\$750,000). To facilitate further research, we release our code, trained model weights, and high quality pseudo-labels for the Argoverse 2 and Waymo Open datasets at https://vedder.io/zeroflow.html

ZeroFlow: Scalable Scene Flow via Distillation

TL;DR

Abstract

750,000). To facilitate further research, we release our code, trained model weights, and high quality pseudo-labels for the Argoverse 2 and Waymo Open datasets at https://vedder.io/zeroflow.html

Paper Structure (24 sections, 4 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 24 sections, 4 equations, 10 figures, 9 tables, 1 algorithm.

Introduction
Background and Related Work
Method
Scaling Scene Flow via Distillation to Large Point Clouds
Neural Scene Flow Prior is a Slow Teacher
FastFlow3D is a Fast Student
Experiments
How does ZeroFlow perform compared to prior art on real point clouds?
How does ZeroFlow scale?
How does dataset diversity influence ZeroFlow's performance?
How do the noise characteristics of ZeroFlow compare to other methods?
How does teacher quality impact ZeroFlow's performance?
Conclusion
Argoverse 2 and Waymo Open Dataset Configuration Details
Exploring the importance of point weighting
...and 9 more sections

Figures (10)

Figure 1: We plot the error and run-time of recent scene flow methods on the Argoverse 2 Sensor dataset argoverse2, along with the size of the point cloud prescribed in the method's evaluation protocol. Our method ZeroFlow 3X (ZeroFlow trained on 3$\times$ pseudo-labeled data) outperforms its teacher (NSFP, nsfp) while running over 1000$\times$ faster, and ZeroFlow XL 3X (ZeroFlow with a larger backbone trained on 3$\times$ pseudo-labeled data) achieves state-of-the-art. Methods that use any human labels are plotted with , and zero-label methods are plotted with .
Figure 2: Scene Flow vectors describe where the point on an object at time $t$ will end up on the object at $t+1$. In this example, ground truth flow vector A, associated with a point in the upper left concave corner of the object at $t$ has no nearby observations at $t+1$ due to occlusion of the concave feature. The ground truth flow vector B, associated with a point on the face of the object at $t$, does not directly match with any observed point on the object at $t+1$ due to observational sparsity. Thus, point matching between $t$ and $t+1$ alone is insufficient to generate ground truth flow.
Figure 3: The Scene Flow via Distillation (SFvD) framework, which describes a new class of scene flow methods that produce high quality, human label-free flow at the speed of feedforward networks.
Figure 4: Empirical scaling laws for ZeroFlow. We report Argoverse 2 validation split Threeway EPE as a percentage of the Argoverse 2 train split used, on a log$_{10}$-log$_{10}$ scale, trained to convergence. Threeway EPE performance of ZeroFlow scales logarithmically with the amount of training data.
Figure 5: Normalized frame birds-eye-view heatmaps of endpoint residuals for Chamfer Distance, as well as the outputs for NSFP and Chodosh on moving points (points with ground truth speed above 0.5m/s). Perfect predictions would produce a single central dot. Top row shows the frequency on a $\log_{10}$ color scale, bottom row shows the frequency on an absolute color scale. Qualitatively, methods with better quantitative results have tighter residual distributions. See Supplemental \ref{['appendix:endpoint_errors_details']} for details.
...and 5 more figures

ZeroFlow: Scalable Scene Flow via Distillation

TL;DR

Abstract

ZeroFlow: Scalable Scene Flow via Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)