Table of Contents
Fetching ...

Leveraging Motion Information for Better Self-Supervised Video Correspondence Learning

Zihan Zhou, Changrui Dai, Aibo Song, Xiaolin Fang

TL;DR

This work explores an efficient self-supervised Video Correspondence Learning framework (MER) that aims to accurately extract object details from unlabeled videos and introduces a flexible sampling strategy for inter-pixel correspondence information (Multi-Cluster Sampler) that enables the model to pay more attention to the pixel changes of important objects in motion.

Abstract

Self-supervised video correspondence learning depends on the ability to accurately associate pixels between video frames that correspond to the same visual object. However, achieving reliable pixel matching without supervision remains a major challenge. To address this issue, recent research has focused on feature learning techniques that aim to encode unique pixel representations for matching. Despite these advances, existing methods still struggle to achieve exact pixel correspondences and often suffer from false matches, limiting their effectiveness in self-supervised settings. To this end, we explore an efficient self-supervised Video Correspondence Learning framework (MER) that aims to accurately extract object details from unlabeled videos. First, we design a dedicated Motion Enhancement Engine that emphasizes capturing the dynamic motion of objects in videos. In addition, we introduce a flexible sampling strategy for inter-pixel correspondence information (Multi-Cluster Sampler) that enables the model to pay more attention to the pixel changes of important objects in motion. Through experiments, our algorithm outperforms the state-of-the-art competitors on video correspondence learning tasks such as video object segmentation and video object keypoint tracking.

Leveraging Motion Information for Better Self-Supervised Video Correspondence Learning

TL;DR

This work explores an efficient self-supervised Video Correspondence Learning framework (MER) that aims to accurately extract object details from unlabeled videos and introduces a flexible sampling strategy for inter-pixel correspondence information (Multi-Cluster Sampler) that enables the model to pay more attention to the pixel changes of important objects in motion.

Abstract

Self-supervised video correspondence learning depends on the ability to accurately associate pixels between video frames that correspond to the same visual object. However, achieving reliable pixel matching without supervision remains a major challenge. To address this issue, recent research has focused on feature learning techniques that aim to encode unique pixel representations for matching. Despite these advances, existing methods still struggle to achieve exact pixel correspondences and often suffer from false matches, limiting their effectiveness in self-supervised settings. To this end, we explore an efficient self-supervised Video Correspondence Learning framework (MER) that aims to accurately extract object details from unlabeled videos. First, we design a dedicated Motion Enhancement Engine that emphasizes capturing the dynamic motion of objects in videos. In addition, we introduce a flexible sampling strategy for inter-pixel correspondence information (Multi-Cluster Sampler) that enables the model to pay more attention to the pixel changes of important objects in motion. Through experiments, our algorithm outperforms the state-of-the-art competitors on video correspondence learning tasks such as video object segmentation and video object keypoint tracking.

Paper Structure

This paper contains 24 sections, 21 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Performance comparison over DAVIS$_{17}$perazzi2016benchmarkval. Our MER surpasses all existing self-supervised methods ($\mathcal{J} \& \mathcal{F}$ (Mean) : 76.5), and is on par with many fully-supervised ones trained with massive annotations.
  • Figure 2: In the MER architecture, the current frame and its associated optical flow are jointly fed into the encoder as a self-concerned query frame. This enables multi-scale similarity sampling and reconstruction using key values stored in memory. During training, the original video frame is used as the self-supervised value. Once the encoder is trained, we transition to using instance masks as values.
  • Figure 3: Label reconstruction process. For two frames with dimensions $h \times w$, there will be a affinity matrix $A \in \mathbb{R}^{hw \times hw}$ to represent the affinity between any two pixels. During the unsupervised training process, we use randomly selected $\mathbf{Lab}$ channel as label.
  • Figure 4: The visualization results of our Motion Enhancement Engine framework. These results strongly demonstrate that our algorithm can highlight the information of moving objects in frames (such as the motorcyclist, the foreground part within the forest area, etc.). See \ref{['section 3.2 Motion Enhancement Engine']} for details.
  • Figure 5: Our Value Extraction Network. Both $x_{target}$ and $x_{ref}$ are obtained through $\Phi(O)$.
  • ...and 6 more figures