Table of Contents
Fetching ...

Dense Monocular Motion Segmentation Using Optical Flow and Pseudo Depth Map: A Zero-Shot Approach

Yuxiang Huang, Yuhao Chen, John Zelek

TL;DR

This work proposes an innovative hybrid approach that leverages the advantages of both deep learning methods and traditional optical flow based methods to perform dense motion segmentation without requiring any training and closely matches with the state-of-theart supervised methods.

Abstract

Motion segmentation from a single moving camera presents a significant challenge in the field of computer vision. This challenge is compounded by the unknown camera movements and the lack of depth information of the scene. While deep learning has shown impressive capabilities in addressing these issues, supervised models require extensive training on massive annotated datasets, and unsupervised models also require training on large volumes of unannotated data, presenting significant barriers for both. In contrast, traditional methods based on optical flow do not require training data, however, they often fail to capture object-level information, leading to over-segmentation or under-segmentation. In addition, they also struggle in complex scenes with substantial depth variations and non-rigid motion, due to the overreliance of optical flow. To overcome these challenges, we propose an innovative hybrid approach that leverages the advantages of both deep learning methods and traditional optical flow based methods to perform dense motion segmentation without requiring any training. Our method initiates by automatically generating object proposals for each frame using foundation models. These proposals are then clustered into distinct motion groups using both optical flow and relative depth maps as motion cues. The integration of depth maps derived from state-of-the-art monocular depth estimation models significantly enhances the motion cues provided by optical flow, particularly in handling motion parallax issues. Our method is evaluated on the DAVIS-Moving and YTVOS-Moving datasets, and the results demonstrate that our method outperforms the best unsupervised method and closely matches with the state-of-theart supervised methods.

Dense Monocular Motion Segmentation Using Optical Flow and Pseudo Depth Map: A Zero-Shot Approach

TL;DR

This work proposes an innovative hybrid approach that leverages the advantages of both deep learning methods and traditional optical flow based methods to perform dense motion segmentation without requiring any training and closely matches with the state-of-theart supervised methods.

Abstract

Motion segmentation from a single moving camera presents a significant challenge in the field of computer vision. This challenge is compounded by the unknown camera movements and the lack of depth information of the scene. While deep learning has shown impressive capabilities in addressing these issues, supervised models require extensive training on massive annotated datasets, and unsupervised models also require training on large volumes of unannotated data, presenting significant barriers for both. In contrast, traditional methods based on optical flow do not require training data, however, they often fail to capture object-level information, leading to over-segmentation or under-segmentation. In addition, they also struggle in complex scenes with substantial depth variations and non-rigid motion, due to the overreliance of optical flow. To overcome these challenges, we propose an innovative hybrid approach that leverages the advantages of both deep learning methods and traditional optical flow based methods to perform dense motion segmentation without requiring any training. Our method initiates by automatically generating object proposals for each frame using foundation models. These proposals are then clustered into distinct motion groups using both optical flow and relative depth maps as motion cues. The integration of depth maps derived from state-of-the-art monocular depth estimation models significantly enhances the motion cues provided by optical flow, particularly in handling motion parallax issues. Our method is evaluated on the DAVIS-Moving and YTVOS-Moving datasets, and the results demonstrate that our method outperforms the best unsupervised method and closely matches with the state-of-theart supervised methods.

Paper Structure

This paper contains 16 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Motion segmentation results of the proposed method by using only optical flow vs. using both optical flow and relative depth as the motion cue. (a) is a frame from a video sequence. (b) is the object proposal generated for this input frame. (c) and (d) are the optical flow mask and the relative monocular depth map generated by the off-the-shelf deep learning models. (e) and (f) show the motion segmentation results of our method by using only optical flow and optical flow + relative depth. In this case, optical flow alone is insufficient to segment the moving object due to motion parallax as well as forward motion.
  • Figure 2: Diagram of our proposed motion segmentation method. Given an image sequence, the proposal extraction module automatically identifies, segments and tracks all common objects through the whole sequence to generate an object proposal for each frame. Meanwhile, the motion cue generation module generates optical flow masks and monocular depth maps using PWC-Net and DINOv2. object-specific optical flow and depth maps are obtained by combining object proposals with optical flow and monocular depth maps. Given these object-specific motion cues, pairwise object similarity scores are computed to consturct the motion similarity matrix. Finally, spectral clustering is used to cluster each object into its motion group.
  • Figure 3: Qualitative comparison with the state-of-the-art unsupervised dense motion segmentation method meunier_em-driven_2023 on DAVIS-Moving (column 1 - 3) and YTVOS-Moving (column 4 - 6) datasets. First row: Original video frame. Second row: Motion segmentation results produced by meunier_em-driven_2023. Third row: Motion segmentation results of our method. Last row: Ground truth
  • Figure 4: Qualitative comparison with state-of-the-art methods on DAVIS-Moving (rows 1-2) and YTVOS-Moving (rows 3-4). MoSeg performs best as a supervised method. RigidMask struggles with non-rigid motions, and Raptor has similar issues but to a lesser extent. Our method matches the performance of the supervised method in these challenging scenarios.
  • Figure 5: Qualitative ablation study: Qualitative comparison between motion segmentation results using optical flow alone (OC) and both optical flow and depth map (OC + Depth). Pure optical flow based motion model (OC) suffers when multiple objects are at different depths. Combining optical flow with depth (OC + Depth) significantly mitigates this problem.