Table of Contents
Fetching ...

Rethinking Training Dynamics in Scale-wise Autoregressive Generation

Gengze Zhou, Chongjian Ge, Hao Tan, Feng Liu, Yicong Hong

TL;DR

The paper tackles exposure bias and scale-wise learning imbalance in scale-wise autoregressive visual generation. It introduces Self-Autoregressive Refinement (SAR) with Stagger-Scale Rollout (SSR) and Contrastive Student-Forcing Loss (CSFL) to align training with inference and stabilize multi-scale predictions. Empirical results on ImageNet-256 show consistent FID improvements across multiple VAR scales with modest compute, and SAR achieves superior throughput–FID trade-offs compared to baselines. The method functions as a lightweight post-training refinement, offering a practical path to stronger and more reliable visual autoregressive generation.

Abstract

Recent advances in autoregressive (AR) generative models have produced increasingly powerful systems for media synthesis. Among them, next-scale prediction has emerged as a popular paradigm, where models generate images in a coarse-to-fine manner. However, scale-wise AR models suffer from exposure bias, which undermines generation quality. We identify two primary causes of this issue: (1) train-test mismatch, where the model must rely on its own imperfect predictions during inference, and (2) imbalance in scale-wise learning difficulty, where certain scales exhibit disproportionately higher optimization complexity. Through a comprehensive analysis of training dynamics, we propose Self-Autoregressive Refinement (SAR) to address these limitations. SAR introduces a Stagger-Scale Rollout (SSR) mechanism that performs lightweight autoregressive rollouts to expose the model to its own intermediate predictions, thereby aligning train-test patterns, and a complementary Contrastive Student-Forcing Loss (CSFL) that provides adequate supervision for self-generated contexts to ensure stable training. Experimental results show that applying SAR to pretrained AR models consistently improves generation quality with minimal computational overhead. For instance, SAR yields a 5.2% FID reduction on FlexVAR-d16 trained on ImageNet 256 within 10 epochs (5 hours on 32xA100 GPUs). Given its efficiency, scalability, and effectiveness, we expect SAR to serve as a reliable post-training method for visual autoregressive generation.

Rethinking Training Dynamics in Scale-wise Autoregressive Generation

TL;DR

The paper tackles exposure bias and scale-wise learning imbalance in scale-wise autoregressive visual generation. It introduces Self-Autoregressive Refinement (SAR) with Stagger-Scale Rollout (SSR) and Contrastive Student-Forcing Loss (CSFL) to align training with inference and stabilize multi-scale predictions. Empirical results on ImageNet-256 show consistent FID improvements across multiple VAR scales with modest compute, and SAR achieves superior throughput–FID trade-offs compared to baselines. The method functions as a lightweight post-training refinement, offering a practical path to stronger and more reliable visual autoregressive generation.

Abstract

Recent advances in autoregressive (AR) generative models have produced increasingly powerful systems for media synthesis. Among them, next-scale prediction has emerged as a popular paradigm, where models generate images in a coarse-to-fine manner. However, scale-wise AR models suffer from exposure bias, which undermines generation quality. We identify two primary causes of this issue: (1) train-test mismatch, where the model must rely on its own imperfect predictions during inference, and (2) imbalance in scale-wise learning difficulty, where certain scales exhibit disproportionately higher optimization complexity. Through a comprehensive analysis of training dynamics, we propose Self-Autoregressive Refinement (SAR) to address these limitations. SAR introduces a Stagger-Scale Rollout (SSR) mechanism that performs lightweight autoregressive rollouts to expose the model to its own intermediate predictions, thereby aligning train-test patterns, and a complementary Contrastive Student-Forcing Loss (CSFL) that provides adequate supervision for self-generated contexts to ensure stable training. Experimental results show that applying SAR to pretrained AR models consistently improves generation quality with minimal computational overhead. For instance, SAR yields a 5.2% FID reduction on FlexVAR-d16 trained on ImageNet 256 within 10 epochs (5 hours on 32xA100 GPUs). Given its efficiency, scalability, and effectiveness, we expect SAR to serve as a reliable post-training method for visual autoregressive generation.

Paper Structure

This paper contains 23 sections, 14 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: (a) Training dynamics. Training curves for FlexVAR, SAR trained from scratch, and SAR initialized from a pretrained FlexVAR checkpoint. Within only a few epochs, SAR quickly surpasses the best performance of a fully-trained FlexVAR model. (b) Throughput--FID trade-off. Comparison of throughput, parameter count, and FID across representative generative model families, including Diffusion, Next-token AR, and Next-scale AR. SAR (red) attains the best overall trade-off: the highest throughput among autoregressive models and further imporve next-scale prediction AR model with the lowest FID across all AR baselines.
  • Figure 2: Illustration of training supervision imbalance. For latent-space (top) supervision, coarse scales receive ground-truth signals that contain little semantic structure, while their corresponding training inputs are dominated by blurry upsampled artifacts. Consequently, the finest scale must reconstruct nearly all details, causing the hierarchical prediction process to collapse into a single dominant scale and preventing effective coarse-to-fine learning. We could smooth the generation trajectory by downsampling in image space (bottom), but this causes the earliest scales to capture most semantics and scene structure already, leaving later scales to perform only mild sharpening or super-resolution, and thereby weakening the multiscale factorization of the model.
  • Figure 3: Training–inference divergence caused by scale-wise supervision imbalance (Sec. \ref{['sec:imbalance']}). Under teacher-forcing (top), the model receives ground-truth latents at all scales; when training converged, the model produces clean generated results because it is evaluated under the same idealized inputs used during training. At inference (bottom), the model must condition on its own coarse-scale predictions. When generated early-scale latents are imperfect (e.g., 1$\times$1), later scales, which were trained mainly as super-resolution task when using the smoothed up/down-sample image supervision, cannot correct the semantic error, leading to a complete collapse of the generation process.
  • Figure 4: Illustration of different student forcing training schemas: (a) Teacher Forcing (TF) uses ground-truth latents at all scales during training. (b) Student Forcing (SF) uses predicted latents only, simulating test-time conditions. (c) Hybrid TF & SF applies teacher forcing at early scales and student forcing at later ones. (d) Interleave TF & SF alternates between teacher and student forcing across scales.
  • Figure 5: Illustration of SAR. The image is encoded into multi-scale latents $\{f_i\}$, which condition an autoregressive generator. In the first forward pass, the model performs teacher forcing and predicts $\hat{f}^{(T)}_i$ at all scales. These predictions are then upsampled to form scale-shifted inputs $\tilde{f}^{(T)}_i$, enabling a second forward pass that produces student-forced predictions $\hat{f}^{(S)}_i$. Teacher-forcing loss provides ground-truth supervision, while the contrastive student-forcing loss aligns student-forced outputs with their teacher-forced counterparts. Together, these two passes form the Stagger-Scale Rollout used in SAR.
  • ...and 3 more figures