Appearance-Based Refinement for Object-Centric Motion Segmentation
Junyu Xie, Weidi Xie, Andrew Zisserman
TL;DR
The paper tackles unsupervised discovery, segmentation, and tracking of independently moving objects in complex videos, where pure flow-based cues struggle due to stationary frames, articulation, and background motion. It introduces a two-stage appearance-based refinement that combines a sequence-level exemplar mask selector with an object-centric mask corrector implemented as a transformer, leveraging temporal appearance consistency via DINO features. The approach is trained entirely on synthetic data and adapted to real videos through self-supervised learning, avoiding human annotations, and achieves competitive single-object results while significantly outperforming rivals on multi-object motion segmentation. It also demonstrates that SAM can complement the proposed method when used as a post-processing prompt, highlighting practical potential for improving video segmentation workflows in real-world applications.
Abstract
The goal of this paper is to discover, segment, and track independently moving objects in complex visual scenes. Previous approaches have explored the use of optical flow for motion segmentation, leading to imperfect predictions due to partial motion, background distraction, and object articulations and interactions. To address this issue, we introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars, and an object-centric architecture that refines problematic masks based on exemplar information. The model is pre-trained on synthetic data and then adapted to real-world videos in a self-supervised manner, eliminating the need for human annotations. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTubeVOS, SegTrackv2, and FBMS-59. We achieve competitive performance on single-object segmentation, while significantly outperforming existing models on the more challenging problem of multi-object segmentation. Finally, we investigate the benefits of using our model as a prompt for the per-frame Segment Anything Model.
