Table of Contents
Fetching ...

UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model

Shuai Yuan, Lei Luo, Zhuo Hui, Can Pu, Xiaoyu Xiang, Rakesh Ranjan, Denis Demandolx

TL;DR

UnSAMFlow tackles occlusion and motion-boundary failures in unsupervised optical flow by integrating object-level cues from the Segment Anything Model (SAM). It introduces three SAM-based adaptations—semantic augmentation, a region-wise homography smoothness loss, and a mask feature module—to enforce object-consistent motion and robust feature aggregation, with optional SAM inputs at inference. The approach achieves state-of-the-art unsupervised results on KITTI and Sintel, demonstrates strong cross-domain generalization, and maintains efficient inference. The work highlights SAM's potential as a zero-shot, open-world semantic prior to guide low-level vision tasks like optical flow without requiring ground-truth labels.

Abstract

Traditional unsupervised optical flow methods are vulnerable to occlusions and motion boundaries due to lack of object-level information. Therefore, we propose UnSAMFlow, an unsupervised flow network that also leverages object information from the latest foundation model Segment Anything Model (SAM). We first include a self-supervised semantic augmentation module tailored to SAM masks. We also analyze the poor gradient landscapes of traditional smoothness losses and propose a new smoothness definition based on homography instead. A simple yet effective mask feature module has also been added to further aggregate features on the object level. With all these adaptations, our method produces clear optical flow estimation with sharp boundaries around objects, which outperforms state-of-the-art methods on both KITTI and Sintel datasets. Our method also generalizes well across domains and runs very efficiently.

UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model

TL;DR

UnSAMFlow tackles occlusion and motion-boundary failures in unsupervised optical flow by integrating object-level cues from the Segment Anything Model (SAM). It introduces three SAM-based adaptations—semantic augmentation, a region-wise homography smoothness loss, and a mask feature module—to enforce object-consistent motion and robust feature aggregation, with optional SAM inputs at inference. The approach achieves state-of-the-art unsupervised results on KITTI and Sintel, demonstrates strong cross-domain generalization, and maintains efficient inference. The work highlights SAM's potential as a zero-shot, open-world semantic prior to guide low-level vision tasks like optical flow without requiring ground-truth labels.

Abstract

Traditional unsupervised optical flow methods are vulnerable to occlusions and motion boundaries due to lack of object-level information. Therefore, we propose UnSAMFlow, an unsupervised flow network that also leverages object information from the latest foundation model Segment Anything Model (SAM). We first include a self-supervised semantic augmentation module tailored to SAM masks. We also analyze the poor gradient landscapes of traditional smoothness losses and propose a new smoothness definition based on homography instead. A simple yet effective mask feature module has also been added to further aggregate features on the object level. With all these adaptations, our method produces clear optical flow estimation with sharp boundaries around objects, which outperforms state-of-the-art methods on both KITTI and Sintel datasets. Our method also generalizes well across domains and runs very efficiently.
Paper Structure (61 sections, 3 equations, 17 figures, 7 tables)

This paper contains 61 sections, 3 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Our UnSAMFlow utilizes object-level information from SAM sam to generate clear optical flow with sharp boundaries.
  • Figure 2: Our network structure. The red part highlights our mask feature adaptation ("+mf"), which is only applied in our second setting where the SAM masks, $M_1$ and $M_2$, are used as additional inputs to the network. See more detailed network structures in Appendix A.1.
  • Figure 3: Examples of object crops selected from KITTI kitti15 and Sintel sintel using SAM sam for semantic augmentation (\ref{['subsec:aug']})
  • Figure 4: An example of why traditional boundary-aware smoothness loss works poorly. Sample from Sintel sintel final (ambush_5, frame #11). (a) Original image superimposed with SAM full segmentation; (b) Image patch; (c) Optical flow estimate from our baseline model superimposed with the SAM boundary (black); (d) Gradients of the traditional boundary-aware smoothness loss; (e) Gradients of our proposed homography smoothness loss; (f) Illustration of the poor landscape of traditional smoothness loss. Note that for both gradients in (d)(e), we use loss definitions based on $\text{L}_2$ norm for better visualizations. See \ref{['subsec:hg']} and Appendix A.6 for explanations.
  • Figure 5: Our proposed mask feature module (\ref{['subsec:mf']})
  • ...and 12 more figures