Online Unsupervised Video Object Segmentation via Contrastive Motion Clustering
Lin Xi, Weihai Chen, Xingming Wu, Zhong Liu, Zhengguo Li
TL;DR
This work addresses online unsupervised video object segmentation by leveraging contrastive motion clustering on optical flow. It jointly learns a compact embedding $Z$ and a set of non-learnable subspace prototypes $\mathcal{P}$ to perform online clustering into motion-based groups, aided by a boundary-prior contrastive loss that discriminates foreground from background. The method achieves state-of-the-art mean region similarity $\mathcal{J}$ on DAVIS$_{16}$, FBMS, and SegTrackV2, while delivering faster online inference than prior approaches thanks to an efficient Sinkhorn-based assignment and prototype-centroid updates. Overall, the approach provides an annotation-free, scalable solution for streaming video segmentation with robust performance across challenging dynamics and motion cues, suitable for real-time applications.
Abstract
Online unsupervised video object segmentation (UVOS) uses the previous frames as its input to automatically separate the primary object(s) from a streaming video without using any further manual annotation. A major challenge is that the model has no access to the future and must rely solely on the history, i.e., the segmentation mask is predicted from the current frame as soon as it is captured. In this work, a novel contrastive motion clustering algorithm with an optical flow as its input is proposed for the online UVOS by exploiting the common fate principle that visual elements tend to be perceived as a group if they possess the same motion pattern. We build a simple and effective auto-encoder to iteratively summarize non-learnable prototypical bases for the motion pattern, while the bases in turn help learn the representation of the embedding network. Further, a contrastive learning strategy based on a boundary prior is developed to improve foreground and background feature discrimination in the representation learning stage. The proposed algorithm can be optimized on arbitrarily-scale data i.e., frame, clip, dataset) and performed in an online fashion. Experiments on $\textit{DAVIS}_{\textit{16}}$, $\textit{FBMS}$, and $\textit{SegTrackV2}$ datasets show that the accuracy of our method surpasses the previous state-of-the-art (SoTA) online UVOS method by a margin of 0.8%, 2.9%, and 1.1%, respectively. Furthermore, by using an online deep subspace clustering to tackle the motion grouping, our method is able to achieve higher accuracy at $3\times$ faster inference time compared to SoTA online UVOS method, and making a good trade-off between effectiveness and efficiency. Our code is available at https://github.com/xilin1991/ClusterNet.
