Table of Contents
Fetching ...

Next-Embedding Prediction Makes Strong Vision Learners

Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu

TL;DR

NEPA introduces Next-Embedding Predictive Autoregression, a minimalist, single-stream pretraining objective that trains a Vision Transformer to predict future patch embeddings from past context, without decoders, tokenizers, or contrastive heads. The approach achieves strong ImageNet-1K fine-tuning results and competitive ADE20K segmentation by relying on causal embedding-level prediction, RoPE, LayerScale, QK-Norm, and SwiGLU for stability and performance. Extensive ablations show the necessity of shifting, causal masking, and stop-gradient, while highlighting the benefits of orthogonal components and scalable results with larger models. The work argues for a simple, scalable, potentially modality-agnostic route to visual self-supervised learning, with clear avenues toward generative and cross-modal applications.

Abstract

Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.

Next-Embedding Prediction Makes Strong Vision Learners

TL;DR

NEPA introduces Next-Embedding Predictive Autoregression, a minimalist, single-stream pretraining objective that trains a Vision Transformer to predict future patch embeddings from past context, without decoders, tokenizers, or contrastive heads. The approach achieves strong ImageNet-1K fine-tuning results and competitive ADE20K segmentation by relying on causal embedding-level prediction, RoPE, LayerScale, QK-Norm, and SwiGLU for stability and performance. Extensive ablations show the necessity of shifting, causal masking, and stop-gradient, while highlighting the benefits of orthogonal components and scalable results with larger models. The work argues for a simple, scalable, potentially modality-agnostic route to visual self-supervised learning, with clear avenues toward generative and cross-modal applications.

Abstract

Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.

Paper Structure

This paper contains 55 sections, 3 equations, 9 figures, 13 tables, 1 algorithm.

Figures (9)

  • Figure 1: Next-Embedding Predictive Autoregression (NEPA). An image is split into patches and embedded into a sequence. An autoregressive model predicts the next embedding from previous ones, mirroring next-token prediction in language models.
  • Figure 2: Images are tokenized via a Conv2d patch embedder before entering a pre-norm Transformer with LayerNorm. Modern stabilization components (RoPE su2024roformer, LayerScale touvron2021cait, SwiGLU shazeer2020glu, and QK-Norm henry2020qknorm) are applied at all layers.
  • Figure 3: Ablation of key components in NEPA pretraining.Left: EMA accuracy with and without AR shift. Without the autoregressive shift, training diverges early. Middle-left: Training loss with and without stop-grad; removing stop-grad causes representation collapse. Middle-right: Training loss with and without LayerScale; LayerScale stabilizes optimization and accelerates convergence. Right: Gradient norm with and without QK-Norm; QK-Norm suppresses gradient explosion and improves smoothness.
  • Figure 4: ImageNet-1K validation Top-1 accuracy versus training epochs. For each epoch’s checkpoint, we perform a lightweight hyperparameter search and report the best accuracy. Fine-tuning uses causal attention. The top plot corresponds to the base model, and the bottom plot to the large model.
  • Figure 5: Attention and embedding analyses. Each example consists of three views: (i) the query patch ($\square$) highlighted in the original image, (ii) the attention map from the NEPA showing which patches the model attends to when predicting the next embedding, and (iii) the embedding-similarity map showing the cosine similarity between the predicted embedding and all other patch embeddings in the same image. Warmer colors indicate higher attention or greater similarity; cooler colors indicate lower values.
  • ...and 4 more figures