Table of Contents
Fetching ...

LSRS: Latent Scale Rejection Sampling for Visual Autoregressive Modeling

Hong-Kai Zheng, Piji Li

TL;DR

VAR accelerates image generation by sampling tokens within each scale in parallel, but independent in-scale sampling can cause structural errors. LSRS addresses this by introducing a lightweight scoring model to evaluate multiple candidate latent-scale token maps and perform rejection sampling during inference, prioritizing early scales that govern structural coherence. Training uses a static dataset of real and generated token maps, with pairwise or pointwise losses guiding the scorer to distinguish real versus generated structures. Empirical results show consistent FID improvements across VAR depths with minimal computational overhead, making LSRS a practical, efficient test-time scaling strategy for VAR-based image generation.

Abstract

Visual Autoregressive (VAR) modeling approach for image generation proposes autoregressive processing across hierarchical scales, decoding multiple tokens per scale in parallel. This method achieves high-quality generation while accelerating synthesis. However, parallel token sampling within a scale may lead to structural errors, resulting in suboptimal generated images. To mitigate this, we propose Latent Scale Rejection Sampling (LSRS), a method that progressively refines token maps in the latent scale during inference to enhance VAR models. Our method uses a lightweight scoring model to evaluate multiple candidate token maps sampled at each scale, selecting the high-quality map to guide subsequent scale generation. By prioritizing early scales critical for structural coherence, LSRS effectively mitigates autoregressive error accumulation while maintaining computational efficiency. Experiments demonstrate that LSRS significantly improves VAR's generation quality with minimal additional computational overhead. For the VAR-d30 model, LSRS increases the inference time by merely 1% while reducing its FID score from 1.95 to 1.78. When the inference time is increased by 15%, the FID score can be further reduced to 1.66. LSRS offers an efficient test-time scaling solution for enhancing VAR-based generation.

LSRS: Latent Scale Rejection Sampling for Visual Autoregressive Modeling

TL;DR

VAR accelerates image generation by sampling tokens within each scale in parallel, but independent in-scale sampling can cause structural errors. LSRS addresses this by introducing a lightweight scoring model to evaluate multiple candidate latent-scale token maps and perform rejection sampling during inference, prioritizing early scales that govern structural coherence. Training uses a static dataset of real and generated token maps, with pairwise or pointwise losses guiding the scorer to distinguish real versus generated structures. Empirical results show consistent FID improvements across VAR depths with minimal computational overhead, making LSRS a practical, efficient test-time scaling strategy for VAR-based image generation.

Abstract

Visual Autoregressive (VAR) modeling approach for image generation proposes autoregressive processing across hierarchical scales, decoding multiple tokens per scale in parallel. This method achieves high-quality generation while accelerating synthesis. However, parallel token sampling within a scale may lead to structural errors, resulting in suboptimal generated images. To mitigate this, we propose Latent Scale Rejection Sampling (LSRS), a method that progressively refines token maps in the latent scale during inference to enhance VAR models. Our method uses a lightweight scoring model to evaluate multiple candidate token maps sampled at each scale, selecting the high-quality map to guide subsequent scale generation. By prioritizing early scales critical for structural coherence, LSRS effectively mitigates autoregressive error accumulation while maintaining computational efficiency. Experiments demonstrate that LSRS significantly improves VAR's generation quality with minimal additional computational overhead. For the VAR-d30 model, LSRS increases the inference time by merely 1% while reducing its FID score from 1.95 to 1.78. When the inference time is increased by 15%, the FID score can be further reduced to 1.66. LSRS offers an efficient test-time scaling solution for enhancing VAR-based generation.

Paper Structure

This paper contains 23 sections, 7 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: The leftmost image is generated using VAR-$d30$ with the class label "fountain". The images labeled from scale 1 to 10 are obtained by replacing the token maps of VAR at each individual scale with random token maps and then decoding the final images.
  • Figure 2: An illustration of LSRS applied during VAR inference. At each scale, multiple candidate token maps are sampled from VAR's output distribution. The LSRS scoring model then evaluates each token map, and the one with the highest score is selected as the final output for that scale.
  • Figure 3: Ablation experiment on hyperparameter $M$ and $ST$. Left: Metrics across $M$ values with $ST=2$. Right: Metrics across $ST$ values with $M=32$. FID BL and IS BL denote the baseline metrics, i.e., those of the original VAR model. Detailed data can be found in Appendix \ref{['sec:st_m_FID_IS']}.
  • Figure 4: LSRS Generation Results Demonstration. The leftmost image is the original VAR-$d30$ generation, while the others show results after LSRS intervention. From top to bottom, they are: mountain tent, balloon, black stork, park bench, lakeside, monarch butterfly and castle.
  • Figure 5: LSRS scoring model analysis. Top-left: validation loss. Top-right: validation accuracy. Bottom-left: VAR-$d24$ score distribution. Bottom-right: VAR-$d30$ score distribution. Overall, the accuracy of the scoring model tends to be higher at larger scales and with smaller models.
  • ...and 4 more figures