Zero-Shot Monocular Scene Flow Estimation in the Wild
Yiqing Liang, Abhishek Badki, Hang Su, James Tompkin, Orazio Gallo
TL;DR
This work tackles the generalization gap in monocular scene flow by proposing a joint geometry–motion model that predicts 3D pointmaps and scene flow in a single feedforward pass. A large-scale, multi-domain data recipe augments diverse indoor/outdoor content to produce over 1M annotated samples, paired with scale-adaptive optimization and a CSO (camera-space 3D offsets) parameterization to align metric and relative data. The approach yields state-of-the-art 3D end-point error and demonstrates robust zero-shot generalization to unseen real-world and robotic datasets, signaling practical applicability beyond autonomous driving. Overall, the method makes scene flow estimation more viable in-the-wild for AR, robotics, and related applications, while highlighting the continued value of 3D priors and integrated geometry–motion learning.
Abstract
Large models have shown generalization across datasets for many low-level vision tasks, like depth estimation, but no such general models exist for scene flow. Even though scene flow has wide potential use, it is not used in practice because current predictive models do not generalize well. We identify three key challenges and propose solutions for each. First, we create a method that jointly estimates geometry and motion for accurate prediction. Second, we alleviate scene flow data scarcity with a data recipe that affords us 1M annotated training samples across diverse synthetic scenes. Third, we evaluate different parameterizations for scene flow prediction and adopt a natural and effective parameterization. Our resulting model outperforms existing methods as well as baselines built on large-scale models in terms of 3D end-point error, and shows zero-shot generalization to the casually captured videos from DAVIS and the robotic manipulation scenes from RoboTAP. Overall, our approach makes scene flow prediction more practical in-the-wild.
