Zero-Shot Monocular Motion Segmentation in the Wild by Combining Deep Learning with Geometric Motion Model Fusion
Yuxiang Huang, Yuhao Chen, John Zelek
TL;DR
This work tackles monocular motion segmentation from a moving camera in a zero-shot setting by fusing deep-learning–derived object proposals with two complementary geometric motion models. The pipeline generates object proposals via foundation models, extracts object-specific point trajectories, optical flow, and monocular depth, and then fits both epipolar ($F$) and flow-depth motion models, constructing two pairwise affinity matrices that are fused with co-regularized multi-view spectral clustering to produce dense motion segmentation without training data. The approach achieves competitive results across DAVIS-Moving, YTVOS-Moving, and KT3DInsMoSeg benchmarks, outperforming some supervised and unsupervised methods and is shown to benefit significantly from model fusion in ablations. Limitations include slower inference due to multiple foundation-model components and the need to assume a known number of motion groups, with future work targeting additional geometric cues (e.g., trifocal tensor) and potential end-to-end training to improve speed and scalability.
Abstract
Detecting and segmenting moving objects from a moving monocular camera is challenging in the presence of unknown camera motion, diverse object motions and complex scene structures. Most existing methods rely on a single motion cue to perform motion segmentation, which is usually insufficient when facing different complex environments. While a few recent deep learning based methods are able to combine multiple motion cues to achieve improved accuracy, they depend heavily on vast datasets and extensive annotations, making them less adaptable to new scenarios. To address these limitations, we propose a novel monocular dense segmentation method that achieves state-of-the-art motion segmentation results in a zero-shot manner. The proposed method synergestically combines the strengths of deep learning and geometric model fusion methods by performing geometric model fusion on object proposals. Experiments show that our method achieves competitive results on several motion segmentation datasets and even surpasses some state-of-the-art supervised methods on certain benchmarks, while not being trained on any data. We also present an ablation study to show the effectiveness of combining different geometric models together for motion segmentation, highlighting the value of our geometric model fusion strategy.
