Table of Contents
Fetching ...

DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

Xiangteng He, Shunsuke Sakai, Kun Yuan, Nicolas Padoy, Tatsuhito Hasegawa, Leonid Sigal

TL;DR

DSeq-JEPA addresses the limitation of uniform region treatment in image-based JEPA by introducing a discriminative sequential learning mechanism. It selects Top-$N$ informative regions via a saliency map and predicts their embeddings in a GPT-style sequence, forming a semantically meaningful curriculum from primary to secondary cues. Across ImageNet, FGVC, detection/segmentation, and low-level reasoning tasks, DSeq-JEPA yields consistent, statistically significant improvements over I-JEPA and related baselines, demonstrating more discriminative and transferable representations. The approach highlights the value of combining selective attention with region-wise autoregressive reasoning, offering a path toward more structured self-supervised learning that aligns with human visual perception.

Abstract

Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns visual representations by predicting latent embeddings of masked regions from visible context. However, it treats all regions uniformly and independently, lacking an explicit notion of where or in what order predictions should be made. Inspired by human visual perception, which deploys attention selectively and sequentially from the most informative to secondary regions, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges predictive and autoregressive self-supervised learning, integrating JEPA-style latent prediction with GPT-style sequential reasoning. Specifically, DSeq-JEPA (i) first identifies primary discriminative regions based on a transformer-derived saliency map, emphasizing the distribution of visual importance, and then (ii) predicts subsequent regions in this discriminative order, progressively forming a curriculum-like semantic progression from primary to secondary cues -- a form of GPT-style pre-training. Extensive experiments across diverse tasks, including image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB-200-2011, Stanford-Cars), detection and segmentation (MS-COCO, ADE20K), and low-level reasoning tasks (Clevr/Count, Clevr/Dist), demonstrate that DSeq-JEPA consistently focuses on more discriminative and generalizable representations than I-JEPA variants. Project page: https://github.com/SkyShunsuke/DSeq-JEPA.

DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

TL;DR

DSeq-JEPA addresses the limitation of uniform region treatment in image-based JEPA by introducing a discriminative sequential learning mechanism. It selects Top- informative regions via a saliency map and predicts their embeddings in a GPT-style sequence, forming a semantically meaningful curriculum from primary to secondary cues. Across ImageNet, FGVC, detection/segmentation, and low-level reasoning tasks, DSeq-JEPA yields consistent, statistically significant improvements over I-JEPA and related baselines, demonstrating more discriminative and transferable representations. The approach highlights the value of combining selective attention with region-wise autoregressive reasoning, offering a path toward more structured self-supervised learning that aligns with human visual perception.

Abstract

Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns visual representations by predicting latent embeddings of masked regions from visible context. However, it treats all regions uniformly and independently, lacking an explicit notion of where or in what order predictions should be made. Inspired by human visual perception, which deploys attention selectively and sequentially from the most informative to secondary regions, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges predictive and autoregressive self-supervised learning, integrating JEPA-style latent prediction with GPT-style sequential reasoning. Specifically, DSeq-JEPA (i) first identifies primary discriminative regions based on a transformer-derived saliency map, emphasizing the distribution of visual importance, and then (ii) predicts subsequent regions in this discriminative order, progressively forming a curriculum-like semantic progression from primary to secondary cues -- a form of GPT-style pre-training. Extensive experiments across diverse tasks, including image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB-200-2011, Stanford-Cars), detection and segmentation (MS-COCO, ADE20K), and low-level reasoning tasks (Clevr/Count, Clevr/Dist), demonstrate that DSeq-JEPA consistently focuses on more discriminative and generalizable representations than I-JEPA variants. Project page: https://github.com/SkyShunsuke/DSeq-JEPA.

Paper Structure

This paper contains 39 sections, 5 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: (A) Humans perceive visual scenes selectively and sequentially, focusing on discriminative regions such as the red head, and red chest of a Northern Cardinal. (B) DSeq-JEPA emulates this process by ranking regions by attention-derived importance and predicting each next discriminative region's embedding in a sequential, GPT-style manner.
  • Figure 2: (A) I-JEPA learns to predict the embeddings of the target regions ($y_1$, …, $y_N$) from a single context region $x$, using a predictor network conditioned on latent variables $z$. While (B) DSeq-JEPA learns to predict the embeddings of the next discriminative regions $\{R_2$, …, $R_{N}\}$ based on its sequence of pre-identified regions in a sequential manner, also using a predictor. The order of the discriminative regions is determined by the attention map of the image.
  • Figure 3: Overview of DSeq-JEPA. We compute a saliency map for each input image, identify the Top-$N$ high-response regions (with $N$=3 for illustration), and feed them to the predictor sequentially. During prediction, each region’s embedding is predicted from its preceding regions and positional tokens, and aligned with its target encoder embedding. All encoders and predictors adopt ViT vit architecture.
  • Figure 4: Qualitative visualization of learned attention and selected context/target regions using ViT-B/16 model. For DSeq-JEPA, the numbered regions (1–5) correspond to discriminative regions ordered by their estimated importance.
  • Figure 5: Evolution of patch-level clustering during pre-training. From left to right: input image and 4-cluster results from ViT-B/16 checkpoints at 150, 300, 450, and 600 epochs.
  • ...and 2 more figures