Table of Contents
Fetching ...

MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features

Adrien Bardes, Jean Ponce, Yann LeCun

TL;DR

MC-JEPA introduces a multi-task self-supervised framework that jointly learns optical flow and content features using a single shared encoder. By integrating a PWC-Net–like flow estimator with a VICReg-based content learner, and stabilizing training through variance-covariance regularization, the method delivers competitive flow performance while enhancing downstream segmentation and video understanding. Extensive ablations demonstrate the importance of data mixing, architectural choices, and training schedules for stable, cross-task generalization. The work advocates multi-task, joint-embedding architectures as a practical path toward versatile visual representations applicable to motion and content understanding.

Abstract

Self-supervised learning of visual representations has been focusing on learning content features, which do not capture object motion or location, and focus on identifying and differentiating objects in images and videos. On the other hand, optical flow estimation is a task that does not involve understanding the content of the images on which it is estimated. We unify the two approaches and introduce MC-JEPA, a joint-embedding predictive architecture and self-supervised learning approach to jointly learn optical flow and content features within a shared encoder, demonstrating that the two associated objectives; the optical flow estimation objective and the self-supervised learning objective; benefit from each other and thus learn content features that incorporate motion information. The proposed approach achieves performance on-par with existing unsupervised optical flow benchmarks, as well as with common self-supervised learning approaches on downstream tasks such as semantic segmentation of images and videos.

MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features

TL;DR

MC-JEPA introduces a multi-task self-supervised framework that jointly learns optical flow and content features using a single shared encoder. By integrating a PWC-Net–like flow estimator with a VICReg-based content learner, and stabilizing training through variance-covariance regularization, the method delivers competitive flow performance while enhancing downstream segmentation and video understanding. Extensive ablations demonstrate the importance of data mixing, architectural choices, and training schedules for stable, cross-task generalization. The work advocates multi-task, joint-embedding architectures as a practical path toward versatile visual representations applicable to motion and content understanding.

Abstract

Self-supervised learning of visual representations has been focusing on learning content features, which do not capture object motion or location, and focus on identifying and differentiating objects in images and videos. On the other hand, optical flow estimation is a task that does not involve understanding the content of the images on which it is estimated. We unify the two approaches and introduce MC-JEPA, a joint-embedding predictive architecture and self-supervised learning approach to jointly learn optical flow and content features within a shared encoder, demonstrating that the two associated objectives; the optical flow estimation objective and the self-supervised learning objective; benefit from each other and thus learn content features that incorporate motion information. The proposed approach achieves performance on-par with existing unsupervised optical flow benchmarks, as well as with common self-supervised learning approaches on downstream tasks such as semantic segmentation of images and videos.
Paper Structure (15 sections, 7 equations, 9 figures, 8 tables)

This paper contains 15 sections, 7 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Multi-task self-supervised learning of content and motion features. MC-JEPA combines a self-supervised features learning and optical flow estimation approach in a multi-task setup where with a single shared encoder. The self-supervised learning of content features objective is trained on ImageNet and the self-supervised flow estimation task is trained on various videos datasets. Our final encoder produces features that have motion and content information, and that can be used to estimate optical flow in videos or for content understanding downstream tasks.
  • Figure 2: MC-JEPA architecture. Our method learns motion through optical flow estimation on videos and content through joint-embedding of views of images, in a multi-task way with a shared encoder. Our optical flow estimation architecture is based on PWC-Net sun2018pwcnet and works as follows. Given a pair of consecutive frames $I_t$, $I_{t+1}$ in a video, an encoder produces a set of pyramidal features $\{X_t^{(l)}\}$ and $\{X_{t+1}^{(l)}\}$. The flow is estimated in a coarse-to-fine manner, starting at the lowest resolution features $X^{(1)}$. A first flow $f_{t,t+1}^{2}$ is estimated by the flow estimator network, then used to warp the features $X_t^{{(2)}}$, which is compared to $X_{t+1}^{{(2)}}$ with a regression loss. The flow is then iteratively refined at every layer by predicting the residual flow and adding it to the previous layer flow. The final flow is used to warp $I_t$ and compare the warped image with $I_{t+1}$ using a reconstruction loss. Forward-backward flow consistency is encouraged with the cycle consistency losses, which minimizes the distance between $X_t^{(l)}$ and $f_{t,t+1}^{(l)}(f_{t+1,t}^{(l)}(X_t^{(l)}))$ at every layer. When the encoder is trained in the multi-task setup with a standard self-supervised learning criterion, the training is very unstable, which is prevented by the variance-covariance regularization term on every feature layer.
  • Figure 3: Qualitative visualization: optical flow. We compare our results of our complete model (MC-JEPA) and our model only pretrained on flow (M-JEPA) with ARFlow. Top 2 rows are from KITTI-15, bottom 2 rows are from Sintel clean and Sintel final.
  • Figure 4: Qualitative visualization: video segmentation. We visualize the segmentation maps obtained by the frozen features learnt with MC-JEPA on the video instance tracking task on DAVIS 2017, for several video sequences, at frames t=1,10,25,50. Frame 1 is given as ground truth, and the others are predicted by our model.
  • Figure 5: (1) Ablation: flow start epoch. Flow estimation performance as a function of the ImageNet training epoch from which flow estimation starts. There are 100 pretraining epochs in total. (2) Ablation: cycle consistency coefficient. Flow estimation performance as a function of the coefficient used to balance the cycle consistency loss of Eq (\ref{['eq:cycle_loss']}). (3) Ablation: multi-task balancing coefficient. Flow estimation and segmentation performance as a function of the balancing coefficient between flow losses and SSL loss in Eq (\ref{['eq:final_loss']}).
  • ...and 4 more figures