Table of Contents
Fetching ...

Multi-scale Autoregressive Models are Laplacian, Discrete, and Latent Diffusion Models in Disguise

Steve Hong, Samuel Belkadi

TL;DR

This work reframes Visual Autoregressive (VAR) models as iterative refiners operating on a latent Laplacian pyramid, connecting VAR to denoising diffusion while preserving cross-scale factorisation and within-scale parallelism. It identifies three design levers—refining in a learned latent space, predicting discrete code indices, and explicit frequency-band refinement—that drive VAR’s fidelity and speed. Through MNIST-based ablations, the paper demonstrates that a small number of coarse-to-fine residual steps can capture most gains, and it discusses extending the framework to permutation-invariant graph generation and probabilistic weather forecasting. The proposed view also enables diffusion-inspired interfaces (guidance, consistency, few-step distillation) to VAR, offering practical routes to faster, scalable generation across domains beyond images.

Abstract

We revisit Visual Autoregressive (VAR) models through the lens of an iterative-refinement framework. Rather than viewing VAR solely as next-scale autoregression, we formalise it as a deterministic forward process that constructs a Laplacian-style latent pyramid, paired with a learned backward process that reconstructs it in a small number of coarse-to-fine steps. This view connects VAR to denoising diffusion and isolates three design choices that help explain its efficiency and fidelity: refining in a learned latent space, casting prediction as discrete classification over code indices, and partitioning the task by spatial frequency. We run controlled experiments to quantify each factor's contribution to fidelity and speed, and we outline how the same framework extends to permutation-invariant graph generation and to probabilistic, ensemble-style medium-range weather forecasting. The framework also suggests practical interfaces for VAR to leverage tools from the diffusion ecosystem while retaining few-step, scale-parallel generation.

Multi-scale Autoregressive Models are Laplacian, Discrete, and Latent Diffusion Models in Disguise

TL;DR

This work reframes Visual Autoregressive (VAR) models as iterative refiners operating on a latent Laplacian pyramid, connecting VAR to denoising diffusion while preserving cross-scale factorisation and within-scale parallelism. It identifies three design levers—refining in a learned latent space, predicting discrete code indices, and explicit frequency-band refinement—that drive VAR’s fidelity and speed. Through MNIST-based ablations, the paper demonstrates that a small number of coarse-to-fine residual steps can capture most gains, and it discusses extending the framework to permutation-invariant graph generation and probabilistic weather forecasting. The proposed view also enables diffusion-inspired interfaces (guidance, consistency, few-step distillation) to VAR, offering practical routes to faster, scalable generation across domains beyond images.

Abstract

We revisit Visual Autoregressive (VAR) models through the lens of an iterative-refinement framework. Rather than viewing VAR solely as next-scale autoregression, we formalise it as a deterministic forward process that constructs a Laplacian-style latent pyramid, paired with a learned backward process that reconstructs it in a small number of coarse-to-fine steps. This view connects VAR to denoising diffusion and isolates three design choices that help explain its efficiency and fidelity: refining in a learned latent space, casting prediction as discrete classification over code indices, and partitioning the task by spatial frequency. We run controlled experiments to quantify each factor's contribution to fidelity and speed, and we outline how the same framework extends to permutation-invariant graph generation and to probabilistic, ensemble-style medium-range weather forecasting. The framework also suggests practical interfaces for VAR to leverage tools from the diffusion ecosystem while retaining few-step, scale-parallel generation.

Paper Structure

This paper contains 28 sections, 20 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Iterative refinement in two paradigms. VAR refines by predicting next-scale discrete codes across a small number of scales, while diffusion refines by denoising continuous latents over multiple timesteps.
  • Figure 2: Forward process of VAR. The image is encoded into a latent grid, progressively downsampled to build a pyramid. At each level, we upsample the coarser level to compute a residual, then quantise that residual into discrete indices, and also quantise the coarsest level as base codes. This forward path is deterministic and provides the training targets for the generator.