Table of Contents
Fetching ...

SCALAR: Scale-wise Controllable Visual Autoregressive Learning

Ryan Xu, Dongyang Jin, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, Xiangxiang Chu

TL;DR

SCALAR tackles the challenge of controllable image generation in Visual Autoregressive (VAR) models by introducing Scale-wise Conditional Decoding, which injects scale-specific control encodings derived from a pretrained image encoder into the VAR backbone, ensuring persistent guidance across the generation hierarchy. Building on this, SCALAR-Uni unifies multiple control modalities in a shared latent space through Unified Control Alignment, enabling flexible multi-condition guidance within a single model. Empirical results on ImageNet-256 show SCALAR achieves superior generation quality and control precision compared to diffusion and prior VAR methods, with zero-shot inpainting/outpainting demonstrating strong generalization. The work suggests a promising direction for scalable, efficient, and versatile controllable generation in VAR frameworks, and provides a practical code release for broader adoption.

Abstract

Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a pretrained image encoder to extract semantic control signal encodings, which are projected into scale-specific representations and injected into the corresponding layers of the VAR backbone. This design provides persistent and structurally aligned guidance throughout the generation process. Building on SCALAR, we develop SCALAR-Uni, a unified extension that aligns multiple control modalities into a shared latent space, supporting flexible multi-conditional guidance in a single model. Extensive experiments show that SCALAR achieves superior generation quality and control precision across various tasks. The code is released at https://github.com/AMAP-ML/SCALAR.

SCALAR: Scale-wise Controllable Visual Autoregressive Learning

TL;DR

SCALAR tackles the challenge of controllable image generation in Visual Autoregressive (VAR) models by introducing Scale-wise Conditional Decoding, which injects scale-specific control encodings derived from a pretrained image encoder into the VAR backbone, ensuring persistent guidance across the generation hierarchy. Building on this, SCALAR-Uni unifies multiple control modalities in a shared latent space through Unified Control Alignment, enabling flexible multi-condition guidance within a single model. Empirical results on ImageNet-256 show SCALAR achieves superior generation quality and control precision compared to diffusion and prior VAR methods, with zero-shot inpainting/outpainting demonstrating strong generalization. The work suggests a promising direction for scalable, efficient, and versatile controllable generation in VAR frameworks, and provides a practical code release for broader adoption.

Abstract

Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a pretrained image encoder to extract semantic control signal encodings, which are projected into scale-specific representations and injected into the corresponding layers of the VAR backbone. This design provides persistent and structurally aligned guidance throughout the generation process. Building on SCALAR, we develop SCALAR-Uni, a unified extension that aligns multiple control modalities into a shared latent space, supporting flexible multi-conditional guidance in a single model. Extensive experiments show that SCALAR achieves superior generation quality and control precision across various tasks. The code is released at https://github.com/AMAP-ML/SCALAR.

Paper Structure

This paper contains 14 sections, 7 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: SCALAR, a novel controllable VAR method, achieves superior generation quality and control capabilities for various types of controllable signals (top row). It also exhibits robust zero-shot generalizability to tasks such as inpainting and outpainting (middle row). SCALAR-Uni further extends it by supporting multi-condition control within a unified model (bottom row).
  • Figure 2: The framework of our SCALAR applies a next-scale paradigm adapted for VAR to design a Scale-wise Conditional Decoding mechanism (see \ref{['sec:scale_wise_conditional_decoding']} for details). The feature $\mathbf{F}_{c}$ is obtained by concatenating four features $\mathbf{F}_{d}$ extracted by the Image Encoder $\mathcal{E}$.
  • Figure 3: (a) Comparison of parameter sharing for projection blocks ($\mathcal{P}_{k,l}$, $\mathcal{P}_{k}$, and $\mathcal{P}_{l}$). (b) Comparison of various injection layers set ($\mathcal{S}_1$, $\mathcal{S}_{\text{alt}}$, and $\mathcal{S}_{\text{all}}$) with different structures of projection block (Linear and LinearLite). (c) Comparison of different parameter-efficient training strategies ($\text{Frz}_\text{none}$, $\text{Frz}_\text{SA}$, and $\text{Frz}_\text{all}$). (d) Impacts of scaling up the depth of VAR backbone (VAR-d12, d16, and d20). Note: All experiments are conducted on ImageNet imagenet with the c2i controllable generation.
  • Figure 4: Our SCALAR-Uni, a unified multi-condition control method. Building on SCALAR, we introduce unified control alignment to map diverse control features into a common, modality-agnostic latent space. During training, control images are randomly sampled with equal probability.
  • Figure 5: Visual results generated by SCALAR for class-to-image controllable generation.
  • ...and 5 more figures