Table of Contents
Fetching ...

Zero-Shot Monocular Motion Segmentation in the Wild by Combining Deep Learning with Geometric Motion Model Fusion

Yuxiang Huang, Yuhao Chen, John Zelek

TL;DR

This work tackles monocular motion segmentation from a moving camera in a zero-shot setting by fusing deep-learning–derived object proposals with two complementary geometric motion models. The pipeline generates object proposals via foundation models, extracts object-specific point trajectories, optical flow, and monocular depth, and then fits both epipolar ($F$) and flow-depth motion models, constructing two pairwise affinity matrices that are fused with co-regularized multi-view spectral clustering to produce dense motion segmentation without training data. The approach achieves competitive results across DAVIS-Moving, YTVOS-Moving, and KT3DInsMoSeg benchmarks, outperforming some supervised and unsupervised methods and is shown to benefit significantly from model fusion in ablations. Limitations include slower inference due to multiple foundation-model components and the need to assume a known number of motion groups, with future work targeting additional geometric cues (e.g., trifocal tensor) and potential end-to-end training to improve speed and scalability.

Abstract

Detecting and segmenting moving objects from a moving monocular camera is challenging in the presence of unknown camera motion, diverse object motions and complex scene structures. Most existing methods rely on a single motion cue to perform motion segmentation, which is usually insufficient when facing different complex environments. While a few recent deep learning based methods are able to combine multiple motion cues to achieve improved accuracy, they depend heavily on vast datasets and extensive annotations, making them less adaptable to new scenarios. To address these limitations, we propose a novel monocular dense segmentation method that achieves state-of-the-art motion segmentation results in a zero-shot manner. The proposed method synergestically combines the strengths of deep learning and geometric model fusion methods by performing geometric model fusion on object proposals. Experiments show that our method achieves competitive results on several motion segmentation datasets and even surpasses some state-of-the-art supervised methods on certain benchmarks, while not being trained on any data. We also present an ablation study to show the effectiveness of combining different geometric models together for motion segmentation, highlighting the value of our geometric model fusion strategy.

Zero-Shot Monocular Motion Segmentation in the Wild by Combining Deep Learning with Geometric Motion Model Fusion

TL;DR

This work tackles monocular motion segmentation from a moving camera in a zero-shot setting by fusing deep-learning–derived object proposals with two complementary geometric motion models. The pipeline generates object proposals via foundation models, extracts object-specific point trajectories, optical flow, and monocular depth, and then fits both epipolar () and flow-depth motion models, constructing two pairwise affinity matrices that are fused with co-regularized multi-view spectral clustering to produce dense motion segmentation without training data. The approach achieves competitive results across DAVIS-Moving, YTVOS-Moving, and KT3DInsMoSeg benchmarks, outperforming some supervised and unsupervised methods and is shown to benefit significantly from model fusion in ablations. Limitations include slower inference due to multiple foundation-model components and the need to assume a known number of motion groups, with future work targeting additional geometric cues (e.g., trifocal tensor) and potential end-to-end training to improve speed and scalability.

Abstract

Detecting and segmenting moving objects from a moving monocular camera is challenging in the presence of unknown camera motion, diverse object motions and complex scene structures. Most existing methods rely on a single motion cue to perform motion segmentation, which is usually insufficient when facing different complex environments. While a few recent deep learning based methods are able to combine multiple motion cues to achieve improved accuracy, they depend heavily on vast datasets and extensive annotations, making them less adaptable to new scenarios. To address these limitations, we propose a novel monocular dense segmentation method that achieves state-of-the-art motion segmentation results in a zero-shot manner. The proposed method synergestically combines the strengths of deep learning and geometric model fusion methods by performing geometric model fusion on object proposals. Experiments show that our method achieves competitive results on several motion segmentation datasets and even surpasses some state-of-the-art supervised methods on certain benchmarks, while not being trained on any data. We also present an ablation study to show the effectiveness of combining different geometric models together for motion segmentation, highlighting the value of our geometric model fusion strategy.
Paper Structure (18 sections, 2 equations, 4 figures, 1 table)

This paper contains 18 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Motion segmentation results from the proposed method using different motion cues on a scene with motion parallax and degeneracy. Motion cues used: (a) point trajectory. (b) optical flow. (c) optical flow + depth. (d) trajectory + optical flow + depth. Using a single motion cue is insufficient to correctly segment out the moving cyclist.
  • Figure 2: Our Motion Segmentation Pipeline. Our method can be summarized to three main steps: 1) given a sequence of video frames, we produce an object proposal by automatically detecting, segmenting and tracking common objects in the video. 2) we compute object-specific point trajectories, optical flow and monocular depth maps for every frame. 3) we compute pairwise object motion similarity scores using two motion models (one based on point trajectories and the other based on optical flow and depth map), and use them to construct two motion affinity matrices. The two matrices are fused using multi-view spectral clustering to cluster objects into different motion groups.
  • Figure 3: Qualitative results of different methods on DAVIS-Moving (row 1, 2), YTVOS-Moving (row 3, 4) and the extended KT3DMoSeg (row 5, 6) datasets. MoSeg often mistakenly label static objects as dynamic when there is degenerate camera motion. RigidMask fails to detect or coherently segment objects with non-rigid motions. Similarly, Raptor also has these problems, although to a lesser extent overall. Our method, despite being zero-shot, performs well when facing these challenges.
  • Figure 4: Qualitative comparison of different motion models on different scenes. Pure optical flow based motion model (OC) suffers on scenes with objects at varying depths. Combining optical flow with depth information (OC + Depth) only alleviates this problem to some extent. Pure point trajectory based motion model (Trajs) suffers from motions near the epipolar plane and inaccurate trajectory estimation. Motion model fusion solves these problem by combining the advantages of both motion models and outperforms any single model.