Table of Contents
Fetching ...

Denoising Autoregressive Representation Learning

Yazhe Li, Jorg Bornschein, Ting Chen

TL;DR

Denoising Autoregressive Representation Learning (DARL) presents a unified approach to learning visual representations while enabling generation by using a decoder-only Vision Transformer that predicts image patches autoregressively. It examines two training objectives—MSE and a diffusion-based objective with a denoising patch decoder—and shows that 2D rotary positional embeddings (2D RoPE) and appropriate noise schedules are crucial for performance, especially in larger models and longer pre-training. DARL achieves results close to state-of-the-art masked-prediction methods on ImageNet and VTAB, with diffusion-based pretraining providing advantages under extended training and larger patch sizes. The work highlights a viable path toward models capable of both strong perception and generation, albeit with a capacity trade-off between high-level abstraction and low-level detail and with implications for scaling and responsible deployment.

Abstract

In this paper, we explore a new generative approach for learning visual representations. Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively. We find that training with Mean Squared Error (MSE) alone leads to strong representations. To enhance the image generation ability, we replace the MSE loss with the diffusion objective by using a denoising patch decoder. We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models. Notably, the optimal schedule differs significantly from the typical ones used in standard image diffusion models. Overall, despite its simple architecture, DARL delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. This marks an important step towards a unified model capable of both visual perception and generation, effectively combining the strengths of autoregressive and denoising diffusion models.

Denoising Autoregressive Representation Learning

TL;DR

Denoising Autoregressive Representation Learning (DARL) presents a unified approach to learning visual representations while enabling generation by using a decoder-only Vision Transformer that predicts image patches autoregressively. It examines two training objectives—MSE and a diffusion-based objective with a denoising patch decoder—and shows that 2D rotary positional embeddings (2D RoPE) and appropriate noise schedules are crucial for performance, especially in larger models and longer pre-training. DARL achieves results close to state-of-the-art masked-prediction methods on ImageNet and VTAB, with diffusion-based pretraining providing advantages under extended training and larger patch sizes. The work highlights a viable path toward models capable of both strong perception and generation, albeit with a capacity trade-off between high-level abstraction and low-level detail and with implications for scaling and responsible deployment.

Abstract

In this paper, we explore a new generative approach for learning visual representations. Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively. We find that training with Mean Squared Error (MSE) alone leads to strong representations. To enhance the image generation ability, we replace the MSE loss with the diffusion objective by using a denoising patch decoder. We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models. Notably, the optimal schedule differs significantly from the typical ones used in standard image diffusion models. Overall, despite its simple architecture, DARL delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. This marks an important step towards a unified model capable of both visual perception and generation, effectively combining the strengths of autoregressive and denoising diffusion models.
Paper Structure (44 sections, 13 equations, 15 figures, 16 tables)

This paper contains 44 sections, 13 equations, 15 figures, 16 tables.

Figures (15)

  • Figure 1: DARL architecture. Images are segmented into non-overlapping patches to form an input sequence. Causal attention masking is applied to the Vision Transformer. Random noises, parameterized by a noise schedule, are independently sampled to corrupt the patches. The output of the Transformer, along with the corrupted patch, are taken as input to the patch decoder to reconstruct the clean patch.
  • Figure 2: Noise schedule.$\gamma$ is sampled directly from a Beta distribution parameterized by $a$ and $b$. Left: Beta distributions with varying values for $a$ and $b$. Right: the corresponding transformation function if $\gamma$ is computed from a transformation from $s$ sampled from a uniform distribution.
  • Figure 3: ImageNet top-1 accuracy of models pre-trained with different noise schedules.\ref{['fig:imagenet_noise_schedule']} and \ref{['fig:imagenet_noise_schedule_p56']} are trained with patch size 16 and 56 respectively. Models are pre-trained for 100 epochs and fine-tuned for 50 epochs. The colormap corresponds to threshold values of every 10 percentile, i.e. 10th, 20th, ..., 90th percentile. The x-axis and y-axis are hyperparameters $a$ and $b$ of the Beta distribution from which $\gamma$ is sampled. The optimal noise schedule of ViT-L16 is biased toward extremely high noise levels, while ViT-L56 prefers a more balanced one.
  • Figure 4: ImageNet top-1 accuracy with varying training length. Model trained with diffusion objective outperforms MSE with longer training schedules. Diffusion noise schedule is $a=0.03$ and $b=1$.
  • Figure 5: ImageNet top-1 accuracy of model pre-trained with varying patch sizes. Model trained with diffusion objective degrades more gracefully compared to MSE loss.
  • ...and 10 more figures