A Training-Free Style-aligned Image Generation with Scale-wise Autoregressive Model
Jihun Park, Jongmin Gim, Kyoungmin Lee, Minseok Oh, Minwoo Choi, Jaeyeul Kim, Woo Chool Park, Sunghoon Im
TL;DR
The paper tackles style misalignment and slow inference in large-scale text-to-image generation by introducing a training-free framework built on a scale-wise autoregressive model. It analyzes next-scale prediction to identify when RGB statistics, object placement, and style are established, and then applies three interventions—initial feature replacement, pivotal feature interpolation, and dynamic style injection—to align style across a set of images while preserving content. Empirical results show competitive generation quality, superior style consistency, and substantially faster inference (over sixfold faster than the fastest baselines) with robust ablations and user studies supporting the approach. The method also demonstrates generalization to other models, indicating broad applicability to efficient, style-consistent T2I generation without additional training.
Abstract
We present a training-free style-aligned image generation method that leverages a scale-wise autoregressive model. While large-scale text-to-image (T2I) models, particularly diffusion-based methods, have demonstrated impressive generation quality, they often suffer from style misalignment across generated image sets and slow inference speeds, limiting their practical usability. To address these issues, we propose three key components: initial feature replacement to ensure consistent background appearance, pivotal feature interpolation to align object placement, and dynamic style injection, which reinforces style consistency using a schedule function. Unlike previous methods requiring fine-tuning or additional training, our approach maintains fast inference while preserving individual content details. Extensive experiments show that our method achieves generation quality comparable to competing approaches, significantly improves style alignment, and delivers inference speeds over six times faster than the fastest model.
