Table of Contents
Fetching ...

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes

Abstract

We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Abstract

We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.
Paper Structure (79 sections, 3 equations, 16 figures, 13 tables)

This paper contains 79 sections, 3 equations, 16 figures, 13 tables.

Figures (16)

  • Figure 1: V-JEPA 2.1 unlocks high-quality dense features. We compute PCA on patch features extracted from the same image or video and map the top three components to RGB channels for both V-JEPA 2 (ViT-g) and V-JEPA 2.1 (ViT-G). Our novel V-JEPA 2.1 produces dense representations with strong spatial and temporal consistency, learning semantically coherent features where similar objects map to the same PCA components.
  • Figure 1: Impact of individual components of our novel V-JEPA 2.1 training recipe. Results on image classification (IN1K), video classification (SSv2), depth estimation (NYU), and semantic segmentation (ADE20K). Introducing the Context Loss improves dense tasks but reduces classification performance on SSv2. Incorporating our Deep Self-Supervision restores the classification performance. The VisionMix 163M dataset, the Multi-Modal Tokenizer, and scaling model size further improve the results.
  • Figure 2: V-JEPA 2.1 ViT-G performance across dense and global prediction tasks. We show the relative improvements of V-JEPA 2.1 compared to the previous V-JEPA 2 ViT-g model assran2025v. We also report the performance of previous SOTA models using a frozen backbone evaluation: DINOv3 simeoni2025dinov3 is the reference model for depth estimation, object tracking, semantic segmentation, SSv2 action recognition, and image classification; InternVideo2s wang2024internvideo2 for K400 action recognition, STAformer mur2024aff for short-term object interaction anticipation, and PlausiVL mittal2024can for action anticipation. Tasks where V-JEPA 2.1 ViT-G obtains SOTA in frozen-backbone evaluation are underlined.
  • Figure 3: Influence of the context loss $\mathcal{L}_{ctx}$. We show PCA visualizations of the feature map representations learned with V-JEPA 2 and with a model trained using V-JEPA 2 plus our $\mathcal{L}_{ctx}$ loss. While V-JEPA features only show fragmented local spatial structure, explicitly supervising unmasked regions with $\mathcal{L}_{ctx}$ leads to feature maps that exhibit coherent spatial structure. Similar semantic parts (e.g., heads of dogs, wheels of cars) are mapped to the same PCA components.
  • Figure 4: V-JEPA 2.1 Detailed Architecture. Images and videos are processed by respectively either a 2D or 3D Convolutional patch embedding. Then, 3D Rotational Positional Encoding (RoPE) and learnable modality embedding are added. The $x-$encoder processes the visible tokens and outputs multi-level embeddings formed by concatenating the normalized output from intermediate encoder blocks. Then, a MLP fuses this multi-level representation and reduces its dimensionality. These context tokens are concatenated with learnable mask tokens carrying spatio-temporal positional information. The predictor processes the combined sequence and produces multi-level predictions for the masked tokens. Training uses two different losses: (i) an L1 loss on masked-token predictions (the original V-JEPA objective), and (ii) a distance-weighted L1 loss on nearby context tokens, both supervised using the $y$-encoder multi-level outputs.
  • ...and 11 more figures