Table of Contents
Fetching ...

FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction

Siyu Jiao, Gengwei Zhang, Yinlong Qian, Jiancheng Huang, Yao Zhao, Humphrey Shi, Lin Ma, Yunchao Wei, Zequn Jie

TL;DR

FlexVAR rethinks visual autoregressive modeling by replacing residual prediction with ground-truth prediction at each scale, enabling flexible image generation across multiple resolutions, aspect ratios, and inference steps. It pairs a scalable VQVAE tokenizer with a GT-prediction transformer and 2D scalable positional embeddings to model multi-scale latent sequences, achieving state-of-the-art results on ImageNet-256×256 with competitive zero-shot performance at 512×512. The approach supports image-to-image tasks without fine-tuning and demonstrates strong generalization across resolutions, though high-resolution stability remains limited by dataset diversity. Overall, FlexVAR provides a powerful, flexible baseline for scalable, non-residual visual autoregression and broad downstream applicability.

Abstract

This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm. FlexVAR facilitates autoregressive learning with ground-truth prediction, enabling each step to independently produce plausible images. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable. Trained solely on low-resolution images ($\leq$ 256px), FlexVAR can: (1) Generate images of various resolutions and aspect ratios, even exceeding the resolution of the training images. (2) Support various image-to-image tasks, including image refinement, in/out-painting, and image expansion. (3) Adapt to various autoregressive steps, allowing for faster inference with fewer steps or enhancing image quality with more steps. Our 1.0B model outperforms its VAR counterpart on the ImageNet 256$\times$256 benchmark. Moreover, when zero-shot transfer the image generation process with 13 steps, the performance further improves to 2.08 FID, outperforming state-of-the-art autoregressive models AiM/VAR by 0.25/0.28 FID and popular diffusion models LDM/DiT by 1.52/0.19 FID, respectively. When transferring our 1.0B model to the ImageNet 512$\times$512 benchmark in a zero-shot manner, FlexVAR achieves competitive results compared to the VAR 2.3B model, which is a fully supervised model trained at 512$\times$512 resolution.

FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction

TL;DR

FlexVAR rethinks visual autoregressive modeling by replacing residual prediction with ground-truth prediction at each scale, enabling flexible image generation across multiple resolutions, aspect ratios, and inference steps. It pairs a scalable VQVAE tokenizer with a GT-prediction transformer and 2D scalable positional embeddings to model multi-scale latent sequences, achieving state-of-the-art results on ImageNet-256×256 with competitive zero-shot performance at 512×512. The approach supports image-to-image tasks without fine-tuning and demonstrates strong generalization across resolutions, though high-resolution stability remains limited by dataset diversity. Overall, FlexVAR provides a powerful, flexible baseline for scalable, non-residual visual autoregression and broad downstream applicability.

Abstract

This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm. FlexVAR facilitates autoregressive learning with ground-truth prediction, enabling each step to independently produce plausible images. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable. Trained solely on low-resolution images ( 256px), FlexVAR can: (1) Generate images of various resolutions and aspect ratios, even exceeding the resolution of the training images. (2) Support various image-to-image tasks, including image refinement, in/out-painting, and image expansion. (3) Adapt to various autoregressive steps, allowing for faster inference with fewer steps or enhancing image quality with more steps. Our 1.0B model outperforms its VAR counterpart on the ImageNet 256256 benchmark. Moreover, when zero-shot transfer the image generation process with 13 steps, the performance further improves to 2.08 FID, outperforming state-of-the-art autoregressive models AiM/VAR by 0.25/0.28 FID and popular diffusion models LDM/DiT by 1.52/0.19 FID, respectively. When transferring our 1.0B model to the ImageNet 512512 benchmark in a zero-shot manner, FlexVAR achieves competitive results compared to the VAR 2.3B model, which is a fully supervised model trained at 512512 resolution.

Paper Structure

This paper contains 19 sections, 6 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Generated samples from FlexVAR-$d24$ (1.0B). FlexVAR generates images with various resolutions and aspect ratios, even though it is trained with a resolution of $\leq$ 256$\times$256.
  • Figure 2: Comparison between VAR var and our FlexVAR. VAR predicts the GT\ref{['sec:intro']} in step 1 and then predicts the residuals relative to the GT in all subsequent steps. Our FlexVAR predicts the GT at each step.
  • Figure 3: Compared with VQVAE tokenizers varllamagen for multi-scale reconstructing images, we downsample the latent features in VQVAE to multiple scales and then use the VQVAE Decoder to reconstruct images. We upsample images $<$ 100 pixels using bilinear interpolation for a better view.
  • Figure 4: Training loss of VAR vs. FlexVAR. FlexVAR demonstrates a faster convergence rate. We report the results with trained 40 epochs ($\sim$ 70K iterations).
  • Figure 5: Generated samples from 80px to 512px. FlexVAR demonstrates strong consistency across various scales and can generate 512px images, despite the model being trained only on images with resolutions $\leq$ 256. Zoom in for a better view.
  • ...and 12 more figures