Table of Contents
Fetching ...

Moving Object Segmentation: All You Need Is SAM (and Flow)

Junyu Xie, Charig Yang, Weidi Xie, Andrew Zisserman

TL;DR

This work investigates two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects.

Abstract

The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful, and sometimes complex, approaches and training schemes including: self-supervised learning, learning from synthetic datasets, object-centric representations, amodal representations, and many more. Our interest in this paper is to determine if the Segment Anything model (SAM) can contribute to this task. We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects. In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt. These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks. We also extend these frame-level segmentations to sequence-level segmentations that maintain object identity. Again, this simple model achieves outstanding performance across multiple moving object segmentation benchmarks.

Moving Object Segmentation: All You Need Is SAM (and Flow)

TL;DR

This work investigates two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects.

Abstract

The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful, and sometimes complex, approaches and training schemes including: self-supervised learning, learning from synthetic datasets, object-centric representations, amodal representations, and many more. Our interest in this paper is to determine if the Segment Anything model (SAM) can contribute to this task. We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects. In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt. These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks. We also extend these frame-level segmentations to sequence-level segmentations that maintain object identity. Again, this simple model achieves outstanding performance across multiple moving object segmentation benchmarks.
Paper Structure (25 sections, 6 equations, 13 figures, 16 tables, 2 algorithms)

This paper contains 25 sections, 6 equations, 13 figures, 16 tables, 2 algorithms.

Figures (13)

  • Figure 1: Adapting SAM for Video Object Segmentation by incorporating flow.(a) Flow-as-Input: FlowI-SAM takes in optical flow only and predicts frame-level segmentation masks. (b) Flow-as-Prompt: FlowP-SAM takes in RGB and applies flow information as a prompt for frame-level segmentation. (c) Sequence-level mask association: as a post-processing step, the multi-mask selection module autoregressively transforms frame-level mask outputs from FlowI-SAM and/or FlowP-SAM and produces sequence-level masks in which object identities are consistent throughout the sequence.
  • Figure 2: Overview of FlowI-SAM.(a) Inference pipeline of FlowI-SAM. (b) Architecture of FlowI-SAM with trainable parameters labelled. The point prompt token is generated by a frozen prompt encoder.
  • Figure 3: Overview of FlowP-SAM.(a) Inference pipeline of FlowP-SAM. (b) Architecture of FlowP-SAM. The flow prompt generator produces flow prompts to be injected into a SAM-like RGB-based segmentation module. Both modules take in the same point prompt token, which is obtained from a frozen prompt encoder. (c) Detailed architecture of the flow transformer. The input tokens function as queries within a lightweight transformer decoder, iteratively attending to dense flow features. The output moving object score (MOS) token is then processed by an MLP-based head to predict a score indicating whether the input point prompt corresponds to a moving object.
  • Figure 4: Qualitative comparison of flow-only segmentation methods on DAVIS (left), YTVOS (middle), and MoCA (right) sequences. Our FlowI-SAM (seq) successfully identifies moving objects from noisy optical flow background ( e.g., the ducks in the fourth column).
  • Figure 5: Qualitative comparison of RGB-based segmentation methods on DAVIS (left), YTVOS (middle), and SegTrackv2 (right). While the previous method (the third row) struggles to disentangle multiple moving objects ( e.g., mixed gold fishes in the second column), our FlowP-SAM+ FlowI-SAM (seq) accurately separates and segments all moving objects.
  • ...and 8 more figures