Table of Contents
Fetching ...

Appearance-Based Refinement for Object-Centric Motion Segmentation

Junyu Xie, Weidi Xie, Andrew Zisserman

TL;DR

The paper tackles unsupervised discovery, segmentation, and tracking of independently moving objects in complex videos, where pure flow-based cues struggle due to stationary frames, articulation, and background motion. It introduces a two-stage appearance-based refinement that combines a sequence-level exemplar mask selector with an object-centric mask corrector implemented as a transformer, leveraging temporal appearance consistency via DINO features. The approach is trained entirely on synthetic data and adapted to real videos through self-supervised learning, avoiding human annotations, and achieves competitive single-object results while significantly outperforming rivals on multi-object motion segmentation. It also demonstrates that SAM can complement the proposed method when used as a post-processing prompt, highlighting practical potential for improving video segmentation workflows in real-world applications.

Abstract

The goal of this paper is to discover, segment, and track independently moving objects in complex visual scenes. Previous approaches have explored the use of optical flow for motion segmentation, leading to imperfect predictions due to partial motion, background distraction, and object articulations and interactions. To address this issue, we introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars, and an object-centric architecture that refines problematic masks based on exemplar information. The model is pre-trained on synthetic data and then adapted to real-world videos in a self-supervised manner, eliminating the need for human annotations. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTubeVOS, SegTrackv2, and FBMS-59. We achieve competitive performance on single-object segmentation, while significantly outperforming existing models on the more challenging problem of multi-object segmentation. Finally, we investigate the benefits of using our model as a prompt for the per-frame Segment Anything Model.

Appearance-Based Refinement for Object-Centric Motion Segmentation

TL;DR

The paper tackles unsupervised discovery, segmentation, and tracking of independently moving objects in complex videos, where pure flow-based cues struggle due to stationary frames, articulation, and background motion. It introduces a two-stage appearance-based refinement that combines a sequence-level exemplar mask selector with an object-centric mask corrector implemented as a transformer, leveraging temporal appearance consistency via DINO features. The approach is trained entirely on synthetic data and adapted to real videos through self-supervised learning, avoiding human annotations, and achieves competitive single-object results while significantly outperforming rivals on multi-object motion segmentation. It also demonstrates that SAM can complement the proposed method when used as a post-processing prompt, highlighting practical potential for improving video segmentation workflows in real-world applications.

Abstract

The goal of this paper is to discover, segment, and track independently moving objects in complex visual scenes. Previous approaches have explored the use of optical flow for motion segmentation, leading to imperfect predictions due to partial motion, background distraction, and object articulations and interactions. To address this issue, we introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars, and an object-centric architecture that refines problematic masks based on exemplar information. The model is pre-trained on synthetic data and then adapted to real-world videos in a self-supervised manner, eliminating the need for human annotations. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTubeVOS, SegTrackv2, and FBMS-59. We achieve competitive performance on single-object segmentation, while significantly outperforming existing models on the more challenging problem of multi-object segmentation. Finally, we investigate the benefits of using our model as a prompt for the per-frame Segment Anything Model.
Paper Structure (34 sections, 8 equations, 13 figures, 20 tables)

This paper contains 34 sections, 8 equations, 13 figures, 20 tables.

Figures (13)

  • Figure 1: Overview of our multi-object motion segmentation method. The method starts with proposing object masks based on optical flow inputs ( i.e., flow-based proposals), followed by an appearance-based refinement of flow-predicted masks. The latter stage relies on a selection-correction procedure, where high-quality exemplar masks are selected to guide the correction of other masks. Mask selection involves picking high-quality proposal masks based on both temporal coherence across mask shapes and semantic consistency with object appearances. Mask correction involves using the selected masks (the exemplars) to guide an appearance-based mask correction process in an object-centric model. The success of the method can be seen in the bottom row where correct segmentation masks are predicted, overcoming the deficiencies of the flow-predicted masks in Stage 1.
  • Figure 2: Appearance-based mask selection.Left: The mask selector takes in RGB frames, DINO features, and flow-predicted masks, generating three-channel error maps for each object (in this case the standing person). The error maps are used to select the exemplar masks for each object. The false positive (FP) channel highlights the over-segmented areas (the dog), while the false negative (FN) channel predicts the missing masks (the lower body of the person). The remaining channel encompasses a combination of true positive and true negative (TP + TN) regions, where the flow-predicted masks align accurately with the groundtruth, as indicated by white regions. The groundtruth annotation boundaries (purple contours) are provided for illustration purposes only. Right: Two potential cues underlying the mask selection process: (i) temporal consistency of mask shapes across consecutive frames, and and (ii) the consistency of semantic content -- the mask (including all its parts) should only correspond to one semantic class.
  • Figure 3: Appearance-based mask corrector. The object-centric architecture first utilises the exemplar mask information to initialise independent object queries. The transformer decoder module then refines the query vectors with two cross-attention layers using dense target frame features for one, and exemplar information for the other. There are two major outputs: (i) the refined queries are spatially broadcast and projected to recover DINO features in target frames, and (ii) the cross-attention maps between queries and target frame DINO features are extracted, followed by a CNN-based upsampling module to yield final refined masks. Note, "Exemplar mask frames" refer to frames from which the exemplar masks are selected. For illustration, we present exemplar masks for different objects in the same frame. However, in practice, different objects may have distinct sets of exemplar mask frames.
  • Figure 4: Qualitative comparison on multi-object video segmentation benchmarks, including YTVOS18-m (left) and DAVIS17-m (right).
  • Figure 5: Qualitative comparison on single object video segmentation benchmarks, including DAVIS16 (left), SegTrackv2 (middle), and FBMS-59 (right).
  • ...and 8 more figures