Table of Contents
Fetching ...

Revisiting Autoregressive Models for Generative Image Classification

Ilia Sudakov, Artem Babenko, Dmitry Baranchuk

Abstract

Class-conditional generative models have emerged as accurate and robust classifiers, with diffusion models demonstrating clear advantages over other visual generative paradigms, including autoregressive (AR) models. In this work, we revisit visual AR-based generative classifiers and identify an important limitation of prior approaches: their reliance on a fixed token order, which imposes a restrictive inductive bias for image understanding. We observe that single-order predictions rely more on partial discriminative cues, while averaging over multiple token orders provides a more comprehensive signal. Based on this insight, we leverage recent any-order AR models to estimate order-marginalized predictions, unlocking the high classification potential of AR models. Our approach consistently outperforms diffusion-based classifiers across diverse image classification benchmarks, while being up to 25x more efficient. Compared to state-of-the-art self-supervised discriminative models, our method delivers competitive classification performance - a notable achievement for generative classifiers.

Revisiting Autoregressive Models for Generative Image Classification

Abstract

Class-conditional generative models have emerged as accurate and robust classifiers, with diffusion models demonstrating clear advantages over other visual generative paradigms, including autoregressive (AR) models. In this work, we revisit visual AR-based generative classifiers and identify an important limitation of prior approaches: their reliance on a fixed token order, which imposes a restrictive inductive bias for image understanding. We observe that single-order predictions rely more on partial discriminative cues, while averaging over multiple token orders provides a more comprehensive signal. Based on this insight, we leverage recent any-order AR models to estimate order-marginalized predictions, unlocking the high classification potential of AR models. Our approach consistently outperforms diffusion-based classifiers across diverse image classification benchmarks, while being up to 25x more efficient. Compared to state-of-the-art self-supervised discriminative models, our method delivers competitive classification performance - a notable achievement for generative classifiers.
Paper Structure (41 sections, 5 equations, 10 figures, 10 tables)

This paper contains 41 sections, 5 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Effect of different token orders on AR-based image classification. Each row presents an image and five different token orders used by the any-order AR model to predict its class. Token orders can largely affect the final classification outcome.
  • Figure 2: Order-marginalized generative classification framework. i) Input image is tokenized with the VQ-VAE into a sequence of discrete tokens. ii) Position-instruction tokens are concatenated with the image tokens, and $K$ randomly permuted sequences are constructed. iii) Class condition tokens $c_i$ for each class are appended to every sequence. iv) The RandAR model predicts $\log p(\mathbf{x}|c_i)$ using Equation \ref{['eq:randar_likelihood']}. v) The predicted class $c^*$ is obtained as $\mathop{\mathrm{arg\,max}}\limits_{c_i} \log p(\mathbf{x}|c_i)$.
  • Figure 3: Per-token "discriminative" log-likelihoods, computed as $clip[\log p(\mathbf{x} | c_{true}) - \log p(\mathbf{x} | c_{false}), 0]$, across different orders and $K$ values. $c_{true}$ denotes the correct class; $c_{false}$ refers to a randomly selected incorrect one. Order-marginalized log-likelihood estimates ($K>1$) capture class-specific attributes more accurately.
  • Figure 4: Per-token accuracy of the RandAR classifier across different orders and $K$ values. Center image tokens appear more discriminative, likely due to the center-object bias present in ImageNet. Increasing $K$ consistently improves accuracy for all tokens.
  • Figure 5: Per-token accuracy of the RandAR classifier for $K{=20}$ w.r.t. the prefix length.
  • ...and 5 more figures