Table of Contents
Fetching ...

Zero-Shot Monocular Scene Flow Estimation in the Wild

Yiqing Liang, Abhishek Badki, Hang Su, James Tompkin, Orazio Gallo

TL;DR

This work tackles the generalization gap in monocular scene flow by proposing a joint geometry–motion model that predicts 3D pointmaps and scene flow in a single feedforward pass. A large-scale, multi-domain data recipe augments diverse indoor/outdoor content to produce over 1M annotated samples, paired with scale-adaptive optimization and a CSO (camera-space 3D offsets) parameterization to align metric and relative data. The approach yields state-of-the-art 3D end-point error and demonstrates robust zero-shot generalization to unseen real-world and robotic datasets, signaling practical applicability beyond autonomous driving. Overall, the method makes scene flow estimation more viable in-the-wild for AR, robotics, and related applications, while highlighting the continued value of 3D priors and integrated geometry–motion learning.

Abstract

Large models have shown generalization across datasets for many low-level vision tasks, like depth estimation, but no such general models exist for scene flow. Even though scene flow has wide potential use, it is not used in practice because current predictive models do not generalize well. We identify three key challenges and propose solutions for each. First, we create a method that jointly estimates geometry and motion for accurate prediction. Second, we alleviate scene flow data scarcity with a data recipe that affords us 1M annotated training samples across diverse synthetic scenes. Third, we evaluate different parameterizations for scene flow prediction and adopt a natural and effective parameterization. Our resulting model outperforms existing methods as well as baselines built on large-scale models in terms of 3D end-point error, and shows zero-shot generalization to the casually captured videos from DAVIS and the robotic manipulation scenes from RoboTAP. Overall, our approach makes scene flow prediction more practical in-the-wild.

Zero-Shot Monocular Scene Flow Estimation in the Wild

TL;DR

This work tackles the generalization gap in monocular scene flow by proposing a joint geometry–motion model that predicts 3D pointmaps and scene flow in a single feedforward pass. A large-scale, multi-domain data recipe augments diverse indoor/outdoor content to produce over 1M annotated samples, paired with scale-adaptive optimization and a CSO (camera-space 3D offsets) parameterization to align metric and relative data. The approach yields state-of-the-art 3D end-point error and demonstrates robust zero-shot generalization to unseen real-world and robotic datasets, signaling practical applicability beyond autonomous driving. Overall, the method makes scene flow estimation more viable in-the-wild for AR, robotics, and related applications, while highlighting the continued value of 3D priors and integrated geometry–motion learning.

Abstract

Large models have shown generalization across datasets for many low-level vision tasks, like depth estimation, but no such general models exist for scene flow. Even though scene flow has wide potential use, it is not used in practice because current predictive models do not generalize well. We identify three key challenges and propose solutions for each. First, we create a method that jointly estimates geometry and motion for accurate prediction. Second, we alleviate scene flow data scarcity with a data recipe that affords us 1M annotated training samples across diverse synthetic scenes. Third, we evaluate different parameterizations for scene flow prediction and adopt a natural and effective parameterization. Our resulting model outperforms existing methods as well as baselines built on large-scale models in terms of 3D end-point error, and shows zero-shot generalization to the casually captured videos from DAVIS and the robotic manipulation scenes from RoboTAP. Overall, our approach makes scene flow prediction more practical in-the-wild.
Paper Structure (44 sections, 11 equations, 9 figures, 7 tables)

This paper contains 44 sections, 11 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Our model for monocular scene flow estimation predicts accurate pointmaps and 3D offsets for dynamic scenes from two input images (shown below each pane) captured by two cameras $C_1$ and $C_2$ at times $t_1$ an $t_2$, respectively. The animation shows an interpolation of the pointmaps from $C_1$ at $t_1$ to $C_2$ at $t_2$. These examples are from datasets not seen in training, showcasing the strong generalization abilities of our method. [Animated figures --- Please click on each in Adobe Reader to play.]
  • Figure 2: Overview. Our method jointly predicts pointmaps ($X_1$, $X_2$) and scene flow $S$ with an information-sharing ViT backbone followed by three prediction heads ($\text{H}_{X_1}$, $\text{H}_{X_2}$, $\text{H}_S$).
  • Figure 3: Different datasets have different scales. Here we show a frame from MOVi-F greff2021kubric, which is in relative scale, and one from Virtual KITTI cabon2020vkitti2, which is in metric units. We need to account for this while training for both geometry and motion.
  • Figure 4: SF parameterizations. Given cameras $C_1$ and $C_2$ capturing a 3D point at two times, $t_1$ and $t_2$, EP expresses the displacement as the coordinates of $\textbf{x}(t_2)$ in $C_2$, $\Delta D$+OF as $d_2-d_1$ and optical flow $\mu$, and CSO (ours) as the 3D offset between the two points, in red in the diagram. Note: The faded blue point to the right has not been transformed from $C_1$ into $C_2$; it is visualized here with exactly the same numeric coordinates as it has in $C_1$.
  • Figure 5: Qualitative Results: Ours vs. Peers. We compare our method wth the SF peers and one representative of $\Delta D$+OF peers. We show qualitative results from all datasets. For each scene we show a pair of input images, and for each peer we show accuracy maps against Ours for both scene flow ($AccR$) and depth ($\delta_1$). Color legend is shown above.
  • ...and 4 more figures