Table of Contents
Fetching ...

A Training-Free Style-aligned Image Generation with Scale-wise Autoregressive Model

Jihun Park, Jongmin Gim, Kyoungmin Lee, Minseok Oh, Minwoo Choi, Jaeyeul Kim, Woo Chool Park, Sunghoon Im

TL;DR

The paper tackles style misalignment and slow inference in large-scale text-to-image generation by introducing a training-free framework built on a scale-wise autoregressive model. It analyzes next-scale prediction to identify when RGB statistics, object placement, and style are established, and then applies three interventions—initial feature replacement, pivotal feature interpolation, and dynamic style injection—to align style across a set of images while preserving content. Empirical results show competitive generation quality, superior style consistency, and substantially faster inference (over sixfold faster than the fastest baselines) with robust ablations and user studies supporting the approach. The method also demonstrates generalization to other models, indicating broad applicability to efficient, style-consistent T2I generation without additional training.

Abstract

We present a training-free style-aligned image generation method that leverages a scale-wise autoregressive model. While large-scale text-to-image (T2I) models, particularly diffusion-based methods, have demonstrated impressive generation quality, they often suffer from style misalignment across generated image sets and slow inference speeds, limiting their practical usability. To address these issues, we propose three key components: initial feature replacement to ensure consistent background appearance, pivotal feature interpolation to align object placement, and dynamic style injection, which reinforces style consistency using a schedule function. Unlike previous methods requiring fine-tuning or additional training, our approach maintains fast inference while preserving individual content details. Extensive experiments show that our method achieves generation quality comparable to competing approaches, significantly improves style alignment, and delivers inference speeds over six times faster than the fastest model.

A Training-Free Style-aligned Image Generation with Scale-wise Autoregressive Model

TL;DR

The paper tackles style misalignment and slow inference in large-scale text-to-image generation by introducing a training-free framework built on a scale-wise autoregressive model. It analyzes next-scale prediction to identify when RGB statistics, object placement, and style are established, and then applies three interventions—initial feature replacement, pivotal feature interpolation, and dynamic style injection—to align style across a set of images while preserving content. Empirical results show competitive generation quality, superior style consistency, and substantially faster inference (over sixfold faster than the fastest baselines) with robust ablations and user studies supporting the approach. The method also demonstrates generalization to other models, indicating broad applicability to efficient, style-consistent T2I generation without additional training.

Abstract

We present a training-free style-aligned image generation method that leverages a scale-wise autoregressive model. While large-scale text-to-image (T2I) models, particularly diffusion-based methods, have demonstrated impressive generation quality, they often suffer from style misalignment across generated image sets and slow inference speeds, limiting their practical usability. To address these issues, we propose three key components: initial feature replacement to ensure consistent background appearance, pivotal feature interpolation to align object placement, and dynamic style injection, which reinforces style consistency using a schedule function. Unlike previous methods requiring fine-tuning or additional training, our approach maintains fast inference while preserving individual content details. Extensive experiments show that our method achieves generation quality comparable to competing approaches, significantly improves style alignment, and delivers inference speeds over six times faster than the fastest model.

Paper Structure

This paper contains 37 sections, 8 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Comparison between (a) Standard Text-to-Image model (style misaligned) and (b) Ours (style aligned). The top rows use the text prompts "A {Cat, Rose, Dragon, Robot, Santaclaus}" and the bottom rows use "A {Map, Dolphin, Mushroom, Backpack, Saxophone}".
  • Figure 2: Inference time ($\downarrow$, lower is better) vs. dual consistency ($\uparrow$, higher is better) curve comparing ours with competitive methods (StyleAligned hertz2024stylealignedimagegeneration, B-LoRA frenkel2024implicitstylecontentseparationusing, StyleDrop sohn2023styledrop, DreamBooth-LoRA (DB-LoRA) ryu2023low, IP-Adapter ye2023ipadaptertextcompatibleimage, AlignedGen zhang2025alignedgen, CSGO xing2024csgo and DreamO mou2025dreamo).
  • Figure 3: Visualization of images generated at different steps of the next-scale prediction process. In the early- and mid-stages, global composition and overall style are established, while later steps focus on refining details and textures. We also track RGB statistics, content similarity, and style similarity across 400 generated images, comparing each step to the final output (12th step) to evaluate the progression of content preservation, style consistency, and RGB statistics.
  • Figure 4: Overall pipeline of our model. The T5 text encoder processes text prompts $\mathbf{T}$, providing conditions and $\langle \text{SOS} \rangle$ tokens to the transformer. Initial Feature Replacement aligns RGB statistics at the 1st and 2nd generation steps. Pivotal Feature Interpolation adjusts object positions and styles at the $\bar{s}$-th step ($\bar{s}$=3), while Dynamic Style Injection gradually reduces style influence from the 3rd to 7th steps. The decoder converts the transformer’s final step output $\mathbf{F}_{S}$ into the style-aligned images $\mathbf{I}$.
  • Figure 5: Qualitative comparison with state-of-the-art style-aligned image generation models.
  • ...and 10 more figures