Table of Contents
Fetching ...

Toward Scalable, Flexible Scene Flow for Point Clouds

Kyle Vedder

TL;DR

This work advances scalable, flexible scene flow for point clouds by combining unsupervised distillation (ZeroFlow), a simple yet effective tracking-based evaluation (Bucket Normalized EPE and TrackFlow), and a novel Eulerian, ODE-based formulation (EulerFlow) that enables long-horizon motion modeling. ZeroFlow demonstrates that large-scale, pseudo-labeled data can outperform costly human labels while running in real time, highlighting data diversity and architecture choices as critical drivers. TrackFlow reveals that standard metrics obscure failures on small objects, prompting a more nuanced evaluation and a simple baseline that achieves strong performance on safety-critical categories. EulerFlow sets a new state-of-the-art in unsupervised scene flow by modeling motion as a continuous-time ODE over an entire sequence, enabling emergent 3D point tracking and broad domain applicability beyond autonomous vehicles. Collectively, these contributions pave the way for robust, scalable, and broadly applicable motion understanding in 3D scenes.

Abstract

Scene flow estimation is the task of describing 3D motion between temporally successive observations. This thesis aims to build the foundation for building scene flow estimators with two important properties: they are scalable, i.e. they improve with access to more data and computation, and they are flexible, i.e. they work out-of-the-box in a variety of domains and on a variety of motion patterns without requiring significant hyperparameter tuning. In this dissertation we present several concrete contributions towards this. In Chapter 1 we contextualize scene flow and its prior methods. In Chapter 2 we present a blueprint to build and scale feedforward scene flow estimators without requiring expensive human annotations via large scale distillation from pseudolabels provided by strong unsupervised test-time optimization methods. In Chapter 3 we introduce a benchmark to better measure estimate quality across diverse object types, better bringing into focus what we care about and expect from scene flow estimators, and use this benchmark to host a public challenge that produced significant progress. In Chapter 4 we present a state-of-the-art unsupervised scene flow estimator that introduces a new, full sequence problem formulation and exhibits great promise in adjacent domains like 3D point tracking. Finally, in Chapter 5 I philosophize about what's next for scene flow and its potential future broader impacts.

Toward Scalable, Flexible Scene Flow for Point Clouds

TL;DR

This work advances scalable, flexible scene flow for point clouds by combining unsupervised distillation (ZeroFlow), a simple yet effective tracking-based evaluation (Bucket Normalized EPE and TrackFlow), and a novel Eulerian, ODE-based formulation (EulerFlow) that enables long-horizon motion modeling. ZeroFlow demonstrates that large-scale, pseudo-labeled data can outperform costly human labels while running in real time, highlighting data diversity and architecture choices as critical drivers. TrackFlow reveals that standard metrics obscure failures on small objects, prompting a more nuanced evaluation and a simple baseline that achieves strong performance on safety-critical categories. EulerFlow sets a new state-of-the-art in unsupervised scene flow by modeling motion as a continuous-time ODE over an entire sequence, enabling emergent 3D point tracking and broad domain applicability beyond autonomous vehicles. Collectively, these contributions pave the way for robust, scalable, and broadly applicable motion understanding in 3D scenes.

Abstract

Scene flow estimation is the task of describing 3D motion between temporally successive observations. This thesis aims to build the foundation for building scene flow estimators with two important properties: they are scalable, i.e. they improve with access to more data and computation, and they are flexible, i.e. they work out-of-the-box in a variety of domains and on a variety of motion patterns without requiring significant hyperparameter tuning. In this dissertation we present several concrete contributions towards this. In Chapter 1 we contextualize scene flow and its prior methods. In Chapter 2 we present a blueprint to build and scale feedforward scene flow estimators without requiring expensive human annotations via large scale distillation from pseudolabels provided by strong unsupervised test-time optimization methods. In Chapter 3 we introduce a benchmark to better measure estimate quality across diverse object types, better bringing into focus what we care about and expect from scene flow estimators, and use this benchmark to host a public challenge that produced significant progress. In Chapter 4 we present a state-of-the-art unsupervised scene flow estimator that introduces a new, full sequence problem formulation and exhibits great promise in adjacent domains like 3D point tracking. Finally, in Chapter 5 I philosophize about what's next for scene flow and its potential future broader impacts.

Paper Structure

This paper contains 95 sections, 21 equations, 51 figures, 14 tables, 1 algorithm.

Figures (51)

  • Figure 1: A deer leaping into the road, as imagined by DALL-E 3.
  • Figure 2: An example of optical flow on the Sintel synthetic dataset sintel. Figure \ref{['fig:sintel_optical_flow']} describes the image space motion of Figure \ref{['fig:sintel_input_1']} (at $t$) as it moves into the view of Figure \ref{['fig:sintel_input_2']} (at $t+1$); the color of each pixel describes flow direction, while intensity describes its magnitude, with white being $\vec{0}$.
  • Figure 4: Visual definition of Endpoint Error (EPE), the workhorse of scene flow evaluation.
  • Figure 5: Scene flow is not just correspondence matching --- scene flow vectors describe where the point on an object at time $t$ will end up on the object at $t+1$. We illustrate this with ground truth flow vectors A and B. Flow vector A, associated with a point in the upper left concave corner of the object at $t$, has no nearby observations at $t+1$ due to occlusion of the concave feature. The flow vector B, associated with a point on the face of the object at $t$, does not directly match with any observed point on the object at $t+1$ due to observational sparsity. Thus, point matching between $t$ and $t+1$ alone is insufficient to generate ground truth flow.
  • Figure 6: Zoox's Autonomous Vehicle features eight lidar sensors, two on each corner of the vehicle. Placing these sensors in a consistent global frame requires a representation that can readily handle arbitrary SE(3) transformations. Photo taken from https://zoox.com/about.
  • ...and 46 more figures