SCALAR: Scale-wise Controllable Visual Autoregressive Learning
Ryan Xu, Dongyang Jin, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, Xiangxiang Chu
TL;DR
SCALAR tackles the challenge of controllable image generation in Visual Autoregressive (VAR) models by introducing Scale-wise Conditional Decoding, which injects scale-specific control encodings derived from a pretrained image encoder into the VAR backbone, ensuring persistent guidance across the generation hierarchy. Building on this, SCALAR-Uni unifies multiple control modalities in a shared latent space through Unified Control Alignment, enabling flexible multi-condition guidance within a single model. Empirical results on ImageNet-256 show SCALAR achieves superior generation quality and control precision compared to diffusion and prior VAR methods, with zero-shot inpainting/outpainting demonstrating strong generalization. The work suggests a promising direction for scalable, efficient, and versatile controllable generation in VAR frameworks, and provides a practical code release for broader adoption.
Abstract
Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a pretrained image encoder to extract semantic control signal encodings, which are projected into scale-specific representations and injected into the corresponding layers of the VAR backbone. This design provides persistent and structurally aligned guidance throughout the generation process. Building on SCALAR, we develop SCALAR-Uni, a unified extension that aligns multiple control modalities into a shared latent space, supporting flexible multi-conditional guidance in a single model. Extensive experiments show that SCALAR achieves superior generation quality and control precision across various tasks. The code is released at https://github.com/AMAP-ML/SCALAR.
