Denoising Autoregressive Representation Learning
Yazhe Li, Jorg Bornschein, Ting Chen
TL;DR
Denoising Autoregressive Representation Learning (DARL) presents a unified approach to learning visual representations while enabling generation by using a decoder-only Vision Transformer that predicts image patches autoregressively. It examines two training objectives—MSE and a diffusion-based objective with a denoising patch decoder—and shows that 2D rotary positional embeddings (2D RoPE) and appropriate noise schedules are crucial for performance, especially in larger models and longer pre-training. DARL achieves results close to state-of-the-art masked-prediction methods on ImageNet and VTAB, with diffusion-based pretraining providing advantages under extended training and larger patch sizes. The work highlights a viable path toward models capable of both strong perception and generation, albeit with a capacity trade-off between high-level abstraction and low-level detail and with implications for scaling and responsible deployment.
Abstract
In this paper, we explore a new generative approach for learning visual representations. Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively. We find that training with Mean Squared Error (MSE) alone leads to strong representations. To enhance the image generation ability, we replace the MSE loss with the diffusion objective by using a denoising patch decoder. We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models. Notably, the optimal schedule differs significantly from the typical ones used in standard image diffusion models. Overall, despite its simple architecture, DARL delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. This marks an important step towards a unified model capable of both visual perception and generation, effectively combining the strengths of autoregressive and denoising diffusion models.
