Table of Contents
Fetching ...

Online Unsupervised Video Object Segmentation via Contrastive Motion Clustering

Lin Xi, Weihai Chen, Xingming Wu, Zhong Liu, Zhengguo Li

TL;DR

This work addresses online unsupervised video object segmentation by leveraging contrastive motion clustering on optical flow. It jointly learns a compact embedding $Z$ and a set of non-learnable subspace prototypes $\mathcal{P}$ to perform online clustering into motion-based groups, aided by a boundary-prior contrastive loss that discriminates foreground from background. The method achieves state-of-the-art mean region similarity $\mathcal{J}$ on DAVIS$_{16}$, FBMS, and SegTrackV2, while delivering faster online inference than prior approaches thanks to an efficient Sinkhorn-based assignment and prototype-centroid updates. Overall, the approach provides an annotation-free, scalable solution for streaming video segmentation with robust performance across challenging dynamics and motion cues, suitable for real-time applications.

Abstract

Online unsupervised video object segmentation (UVOS) uses the previous frames as its input to automatically separate the primary object(s) from a streaming video without using any further manual annotation. A major challenge is that the model has no access to the future and must rely solely on the history, i.e., the segmentation mask is predicted from the current frame as soon as it is captured. In this work, a novel contrastive motion clustering algorithm with an optical flow as its input is proposed for the online UVOS by exploiting the common fate principle that visual elements tend to be perceived as a group if they possess the same motion pattern. We build a simple and effective auto-encoder to iteratively summarize non-learnable prototypical bases for the motion pattern, while the bases in turn help learn the representation of the embedding network. Further, a contrastive learning strategy based on a boundary prior is developed to improve foreground and background feature discrimination in the representation learning stage. The proposed algorithm can be optimized on arbitrarily-scale data i.e., frame, clip, dataset) and performed in an online fashion. Experiments on $\textit{DAVIS}_{\textit{16}}$, $\textit{FBMS}$, and $\textit{SegTrackV2}$ datasets show that the accuracy of our method surpasses the previous state-of-the-art (SoTA) online UVOS method by a margin of 0.8%, 2.9%, and 1.1%, respectively. Furthermore, by using an online deep subspace clustering to tackle the motion grouping, our method is able to achieve higher accuracy at $3\times$ faster inference time compared to SoTA online UVOS method, and making a good trade-off between effectiveness and efficiency. Our code is available at https://github.com/xilin1991/ClusterNet.

Online Unsupervised Video Object Segmentation via Contrastive Motion Clustering

TL;DR

This work addresses online unsupervised video object segmentation by leveraging contrastive motion clustering on optical flow. It jointly learns a compact embedding and a set of non-learnable subspace prototypes to perform online clustering into motion-based groups, aided by a boundary-prior contrastive loss that discriminates foreground from background. The method achieves state-of-the-art mean region similarity on DAVIS, FBMS, and SegTrackV2, while delivering faster online inference than prior approaches thanks to an efficient Sinkhorn-based assignment and prototype-centroid updates. Overall, the approach provides an annotation-free, scalable solution for streaming video segmentation with robust performance across challenging dynamics and motion cues, suitable for real-time applications.

Abstract

Online unsupervised video object segmentation (UVOS) uses the previous frames as its input to automatically separate the primary object(s) from a streaming video without using any further manual annotation. A major challenge is that the model has no access to the future and must rely solely on the history, i.e., the segmentation mask is predicted from the current frame as soon as it is captured. In this work, a novel contrastive motion clustering algorithm with an optical flow as its input is proposed for the online UVOS by exploiting the common fate principle that visual elements tend to be perceived as a group if they possess the same motion pattern. We build a simple and effective auto-encoder to iteratively summarize non-learnable prototypical bases for the motion pattern, while the bases in turn help learn the representation of the embedding network. Further, a contrastive learning strategy based on a boundary prior is developed to improve foreground and background feature discrimination in the representation learning stage. The proposed algorithm can be optimized on arbitrarily-scale data i.e., frame, clip, dataset) and performed in an online fashion. Experiments on , , and datasets show that the accuracy of our method surpasses the previous state-of-the-art (SoTA) online UVOS method by a margin of 0.8%, 2.9%, and 1.1%, respectively. Furthermore, by using an online deep subspace clustering to tackle the motion grouping, our method is able to achieve higher accuracy at faster inference time compared to SoTA online UVOS method, and making a good trade-off between effectiveness and efficiency. Our code is available at https://github.com/xilin1991/ClusterNet.
Paper Structure (13 sections, 17 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 17 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Motion grouping. (a) The RGB image with ground truth; (b) the optical flow visualized by the inset color wheel; (c) the motion segmentation using our proposed prototypical subspace clustering framework (clusters $k$=5). According to the prototypical subspace bases, we clustered different motion patterns (i.e., A-red, B-green, C-grey, D-cyan, and E-dark red). $\textit{sim}(\cdot)$ is the similarity of different motion patterns and is normalized to $[0, 1]$.
  • Figure 2: Paradigm of the proposed clustering method. We iteratively summarize prototypical bases from the embedded representation $\bm{Z}$, and the $\bm{Z}$ are then combined with the prototypes to compute the affinity $S$.
  • Figure 3: The overview optimization diagram for our proposed method with the optical flow as our input. Given an optical flow $\bm{X}$, we utilizes auto-encoder to embed it into a $p$-dimensional embedding feature $\bm{Z}$ and outputs its corresponding reconstruction $\bm{\hat{X}}$. During the optimization phase, we iteratively summarize non-learnable prototypical bases for the motion pattern, while the bases are constrained by our proposed contrastive learning strategy to help shape the feature space. To obtain the final cluster labels, we use the proposed subspace clustering algorithm with a hard assignment to group each pixel to the prototypical bases.
  • Figure 4: The simplified diagram of $p-1$ dimensional unit hypersphere, where each subspace corresponds to the surface area of the unit hypersphere centered on different prototypes, denoted as $\mathcal{P}_{j}$. When $\parallel\mathcal{P}_{i}^{\top}\mathcal{P}_{j}\parallel$ is sufficiently small for all $i\neq j$, it means that each prototype $\mathcal{P}_{j}$ on the unit hypersphere is situated at a greater distance, enabling the identification of a suitable boundary for clustering.
  • Figure 5: Visualization of the embedded representations $\bm{Z}$ with t-SNE tSNE on the bmx-trees sequence from the $\textit{DAVIS}_{\textit{16}}$ dataset. Note that the number of prototypes $k$ is set to 5 for each initialization condition, and we optimize our model for 10 iterations on each frame. represents the each prototype $\mathcal{P}_{j}$.
  • ...and 2 more figures