Table of Contents
Fetching ...

Markovian Scale Prediction: A New Era of Visual Autoregressive Generation

Yu Zhang, Jingyi Liu, Yiwei Shi, Qi Zhang, Duoqian Miao, Changwei Wang, Longbing Cao

TL;DR

This work tackles the inefficiencies and error accumulation of full-context Visual AutoRegressive Modeling (VAR) by reframing next-scale prediction as a Markovian process. It introduces Markov-VAR, which treats each scale as a Markov state and augments it with a history compensation mechanism implemented via a sliding-window history vector to retain essential historical information. Empirical results on ImageNet show Markov-VAR achieves better generation quality (lower FID, higher IS) and drastically reduced peak memory usage compared to VAR and several VAR-like methods, with favorable scaling behavior. The approach offers a simple yet effective foundation for scalable visual autoregressive generation and related downstream tasks, with public release of model weights to support further research.

Abstract

Visual AutoRegressive modeling (VAR) based on next-scale prediction has revitalized autoregressive visual generation. Although its full-context dependency, i.e., modeling all previous scales for next-scale prediction, facilitates more stable and comprehensive representation learning by leveraging complete information flow, the resulting computational inefficiency and substantial overhead severely hinder VAR's practicality and scalability. This motivates us to develop a new VAR model with better performance and efficiency without full-context dependency. To address this, we reformulate VAR as a non-full-context Markov process, proposing Markov-VAR. It is achieved via Markovian Scale Prediction: we treat each scale as a Markov state and introduce a sliding window that compresses certain previous scales into a compact history vector to compensate for historical information loss owing to non-full-context dependency. Integrating the history vector with the Markov state yields a representative dynamic state that evolves under a Markov process. Extensive experiments demonstrate that Markov-VAR is extremely simple yet highly effective: Compared to VAR on ImageNet, Markov-VAR reduces FID by 10.5% (256 $\times$ 256) and decreases peak memory consumption by 83.8% (1024 $\times$ 1024). We believe that Markov-VAR can serve as a foundation for future research on visual autoregressive generation and other downstream tasks.

Markovian Scale Prediction: A New Era of Visual Autoregressive Generation

TL;DR

This work tackles the inefficiencies and error accumulation of full-context Visual AutoRegressive Modeling (VAR) by reframing next-scale prediction as a Markovian process. It introduces Markov-VAR, which treats each scale as a Markov state and augments it with a history compensation mechanism implemented via a sliding-window history vector to retain essential historical information. Empirical results on ImageNet show Markov-VAR achieves better generation quality (lower FID, higher IS) and drastically reduced peak memory usage compared to VAR and several VAR-like methods, with favorable scaling behavior. The approach offers a simple yet effective foundation for scalable visual autoregressive generation and related downstream tasks, with public release of model weights to support further research.

Abstract

Visual AutoRegressive modeling (VAR) based on next-scale prediction has revitalized autoregressive visual generation. Although its full-context dependency, i.e., modeling all previous scales for next-scale prediction, facilitates more stable and comprehensive representation learning by leveraging complete information flow, the resulting computational inefficiency and substantial overhead severely hinder VAR's practicality and scalability. This motivates us to develop a new VAR model with better performance and efficiency without full-context dependency. To address this, we reformulate VAR as a non-full-context Markov process, proposing Markov-VAR. It is achieved via Markovian Scale Prediction: we treat each scale as a Markov state and introduce a sliding window that compresses certain previous scales into a compact history vector to compensate for historical information loss owing to non-full-context dependency. Integrating the history vector with the Markov state yields a representative dynamic state that evolves under a Markov process. Extensive experiments demonstrate that Markov-VAR is extremely simple yet highly effective: Compared to VAR on ImageNet, Markov-VAR reduces FID by 10.5% (256 256) and decreases peak memory consumption by 83.8% (1024 1024). We believe that Markov-VAR can serve as a foundation for future research on visual autoregressive generation and other downstream tasks.

Paper Structure

This paper contains 29 sections, 5 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Visualization of generated images from our Markov-VAR at 256×256 or 512×512 on ImageNet benchmark.
  • Figure 2: Observations of the challenges caused by full-context dependency. (a) Comparison of peak computation state (Activations + KV Cache) memory consumption between depth-24 VAR and Markov-VAR on generating 1024×1024 images with a batch size of 25. (b) Metrics and FID performance of VAR under perturbations injected at different scales. MSE, L1 and LPIPS jointly decrease as the perturbation injection scale shifts larger, indicating that early injected perturbations cause greater performance degradation. This is also evidenced by VAR's largest FID drop at the first injection scale. (c) Residual-Feature Alignment scores (RFA) between each scale and its every previous scale. It is calculated as the cosine similarity between the output residual feature of the current scale and each input feature of all previous scales, combined with $1\times1$ convolution projection and square root operation, and preserves the directional contribution.
  • Figure 3: Left: Comparison of modeling process between VAR and Markov-VAR when predicting the 6-th scale & Comparison of visual context between next-scale prediction and Markovian scale prediction during generation. Markov-VAR utilizes a history compensation mechanism to enrich the current scale for historical information. Right: The overall framework of Markovian scale prediction with Markov-VAR Transformer. [S] is the start token with condition embedding.
  • Figure 4: Analysis of peak memory consumption of Markov-VAR across various depths and resolutions at different scales.
  • Figure 5: Scaling law analysis of Markov-VAR between performance metrics and model sizes with power-law fitted equations.
  • ...and 2 more figures