Table of Contents
Fetching ...

Video Representation Learning with Joint-Embedding Predictive Architectures

Katrina Drozdov, Ravid Shwartz-Ziv, Yann LeCun

TL;DR

The paper introduces Video JEPA with Variance-Covariance Regularization (VJ-VCR), a self-supervised method that predicts in the hidden representation space to learn high-level video dynamics while using variance-covariance regularization to prevent collapse. By optionally incorporating latent variables, VJ-VCR models uncertainty in non-deterministic futures and demonstrates superior capture of object dynamics compared to pixel-space generative baselines on MovingMNIST, CLEVRER, and CATER, with robust information-content analyses. The approach yields practical benefits in terms of representation richness, reduced dimensional collapse, and improved downstream performance for tasks like speed estimation and multi-label action recognition, while offering computational efficiency due to its non-pixel-focused objective. The work also provides a framework for integrating latent variables to model future uncertainty, setting the stage for scalable, interpretable, self-supervised video representation learning across diverse datasets.

Abstract

Video representation learning is an increasingly important topic in machine learning research. We present Video JEPA with Variance-Covariance Regularization (VJ-VCR): a joint-embedding predictive architecture for self-supervised video representation learning that employs variance and covariance regularization to avoid representation collapse. We show that hidden representations from our VJ-VCR contain abstract, high-level information about the input data. Specifically, they outperform representations obtained from a generative baseline on downstream tasks that require understanding of the underlying dynamics of moving objects in the videos. Additionally, we explore different ways to incorporate latent variables into the VJ-VCR framework that capture information about uncertainty in the future in non-deterministic settings.

Video Representation Learning with Joint-Embedding Predictive Architectures

TL;DR

The paper introduces Video JEPA with Variance-Covariance Regularization (VJ-VCR), a self-supervised method that predicts in the hidden representation space to learn high-level video dynamics while using variance-covariance regularization to prevent collapse. By optionally incorporating latent variables, VJ-VCR models uncertainty in non-deterministic futures and demonstrates superior capture of object dynamics compared to pixel-space generative baselines on MovingMNIST, CLEVRER, and CATER, with robust information-content analyses. The approach yields practical benefits in terms of representation richness, reduced dimensional collapse, and improved downstream performance for tasks like speed estimation and multi-label action recognition, while offering computational efficiency due to its non-pixel-focused objective. The work also provides a framework for integrating latent variables to model future uncertainty, setting the stage for scalable, interpretable, self-supervised video representation learning across diverse datasets.

Abstract

Video representation learning is an increasingly important topic in machine learning research. We present Video JEPA with Variance-Covariance Regularization (VJ-VCR): a joint-embedding predictive architecture for self-supervised video representation learning that employs variance and covariance regularization to avoid representation collapse. We show that hidden representations from our VJ-VCR contain abstract, high-level information about the input data. Specifically, they outperform representations obtained from a generative baseline on downstream tasks that require understanding of the underlying dynamics of moving objects in the videos. Additionally, we explore different ways to incorporate latent variables into the VJ-VCR framework that capture information about uncertainty in the future in non-deterministic settings.

Paper Structure

This paper contains 39 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Models for self-supervised video representation learning. Inputs $x$ and targets $y$ denote input and target frames coming from the same video, respectively. The optional latent variable $z$ is intended to capture information about the targets $y$ not present in $x$. In the case of VJ-VCR, the Decoder module is optional. $D$ denotes the MSE loss function in the hidden representation space or in the input (pixel) space. $\mathrm{VC}$ denotes variance-covariance regularization.
  • Figure 2: Multi-label action recognition performed on the CATER dataset. The aggregated set of actions $a_y$ in the target frames is predicted from the inferred latent variable $z^{*}$ using a linear classifier. Latent variables $z^{*}$ computed from our VJ-VCR pre-trained model are more informative about the underlying actions than those from the pre-trained generative-based models using mAP as an evaluation metric on the validation set. The performance of a linear classifier trained on top of randomly generated latent variables $z^{*}$ in this multi-label setting is $39.6\%$.
  • Figure 3: Reconstructions from our VJ-VCR model trained with and without a latent variable on the MovingMNIST dataset with a random switch in the digit trajectory after the third frame. The first three columns show the original target frames, the last three columns show the model's predictions for the target frames and the middle three columns show the overlap between the original and predicted frames (the latter are displayed in green). The model that does not incorporate a latent variable predicts all possible switches in trajectory of the digit, while the one that uses a latent variable can correctly identify the actual switch in digit trajectory.
  • Figure 4: Analysis of the informational content of the learned hidden representations of a VJ-VCR and a generative model through singular value decomposition.
  • Figure 5: Reconstructions from a generative model trained only with loss in pixel space (left) and a VJ-VCR model trained with loss in pixel space, loss in the hidden representation space, and variance-covariance regularization (right). Odd rows display 9 ground truth frames. Even rows display the first 3 ground truth frames which are the input to the model followed by the first 6 (out of 12) reconstructed frames. The model on the left has PSNR of 22.8 and the one on the right has PSNR of 21.2. Both models can predict the trajectories of the digits. Hidden representations from the VJ-VCR model can be used to predict the actual speed of the digits more accurately.