Table of Contents
Fetching ...

Rethinking generative image pretraining: How far are we from scaling up next-pixel prediction?

Xinchen Yan, Chen Liang, Lijun Yu, Adams Wei Yu, Yifeng Lu, Quoc V. Le

TL;DR

This work investigates whether scaling autoregressive next-pixel prediction can yield practical vision models. Using IsoFlops-based scaling studies on $32\times 32$ images, it shows that optimal data and model sizes follow power laws with compute and that the scaling behavior is highly task- and resolution-dependent, with generation demanding more data than classification and higher resolutions favoring larger models. The findings quantify the compute bottleneck and provide exponents for how $N_{opt}$ and $D_{opt}$ scale with compute across metrics, suggesting pixel-level modeling could become feasible within about five years given sustained compute growth. The results offer a roadmap for designing pixel-based vision models and highlight the critical role of compute in determining the practicality of end-to-end raw-pixel transformers.

Abstract

This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation quality measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed 32x32 resolution alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.

Rethinking generative image pretraining: How far are we from scaling up next-pixel prediction?

TL;DR

This work investigates whether scaling autoregressive next-pixel prediction can yield practical vision models. Using IsoFlops-based scaling studies on images, it shows that optimal data and model sizes follow power laws with compute and that the scaling behavior is highly task- and resolution-dependent, with generation demanding more data than classification and higher resolutions favoring larger models. The findings quantify the compute bottleneck and provide exponents for how and scale with compute across metrics, suggesting pixel-level modeling could become feasible within about five years given sustained compute growth. The results offer a roadmap for designing pixel-based vision models and highlight the critical role of compute in determining the practicality of end-to-end raw-pixel transformers.

Abstract

This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation quality measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed 32x32 resolution alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.

Paper Structure

This paper contains 14 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Key findings on the scaling properties of next-pixel prediction, based on training Transformers on 32 $\times$ 32 images. (a) Learning on raw pixels (blue line) requires $10-20 \times$ higher optimal token-to-parameter ratio than learning on language tokens (yellow line). (b) The optimal scaling strategy varies: generation quality (Fréchet Distance, green) requires more training data optimally than classification (Top-1 accuracy, red) or the next-pixel prediction loss (blue). (c)-(d) The optimal-token and optimal-parameter setup is further verified by training models following the scaling prediction. Given 3.5e20 training FLOPs (with 5 $\times$ more compute), we project to reach 46.39% accuracy (vs. 46.41% in reality) and 244 Fréchet Distance (vs. 240 Fréchet Distance in reality).
  • Figure 2: Scaling properties prediction given image resolutions at $32\times 32$: next-pixel prediction loss (See subfigure (a) and (d)), ImageNet classification accuracy (See subfigure (b) and (e)), and image completion-based Fréchet Distance (See subfigure (c) and (f)). We report the best-layer linear probing accuracy. We estimate Fréchet Distance between 2,048 reference images at $32\times 32$ and corresponding 8,192 generated images.
  • Figure 3: Qualitative examples at 32 $\times$ 32. The unmasked top image is provided as initialization and we auto-regressively predict the bottom half image one pixel at a time. Zoom in for a better view.
  • Figure 4: Optimal model / data scaling predictions vs. FLOPs across different image resolutions. We keep the ground-truth reference images at the native resolutions, and report the Fréchet Distance between 10,000 reference images and 10,000 generated images.