Table of Contents
Fetching ...

Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

Anton Voronov, Denis Kuznedelev, Mikhail Khoroshikh, Valentin Khrulkov, Dmitry Baranchuk

TL;DR

Switti tackles the speed-accuracy trade-off in text-to-image synthesis by adopting scale-wise autoregressive transformers and removing explicit causality, yielding a non-causal, highly efficient generator. It leverages a tuned RQ-VAE tokenizer, dual text encoders, and targeted architectural tweaks to stabilize training and improve alignment, while disabling classifier-free guidance at high resolutions further accelerates sampling and enhances detail. Across automated and human evaluations, Switti matches or exceeds autoregressive baselines and remains competitive with diffusion models, while offering up to 7x faster sampling and substantial memory savings. The work also analyzes text conditioning across scales, showing reduced reliance on text at high resolutions and proposing practical CFG-ablations that boost efficiency with minimal quality loss.

Abstract

This work presents Switti, a scale-wise transformer for text-to-image generation. We start by adapting an existing next-scale prediction autoregressive (AR) architecture to T2I generation, investigating and mitigating training stability issues in the process. Next, we argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ~21% faster sampling and lower memory usage while also achieving slightly better generation quality. Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. By disabling guidance at these scales, we achieve an additional sampling acceleration of ~32% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.

Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

TL;DR

Switti tackles the speed-accuracy trade-off in text-to-image synthesis by adopting scale-wise autoregressive transformers and removing explicit causality, yielding a non-causal, highly efficient generator. It leverages a tuned RQ-VAE tokenizer, dual text encoders, and targeted architectural tweaks to stabilize training and improve alignment, while disabling classifier-free guidance at high resolutions further accelerates sampling and enhances detail. Across automated and human evaluations, Switti matches or exceeds autoregressive baselines and remains competitive with diffusion models, while offering up to 7x faster sampling and substantial memory savings. The work also analyzes text conditioning across scales, showing reduced reliance on text at high resolutions and proposing practical CFG-ablations that boost efficiency with minimal quality loss.

Abstract

This work presents Switti, a scale-wise transformer for text-to-image generation. We start by adapting an existing next-scale prediction autoregressive (AR) architecture to T2I generation, investigating and mitigating training stability issues in the process. Next, we argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ~21% faster sampling and lower memory usage while also achieving slightly better generation quality. Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. By disabling guidance at these scales, we achieve an additional sampling acceleration of ~32% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.

Paper Structure

This paper contains 45 sections, 2 equations, 22 figures, 8 tables.

Figures (22)

  • Figure 1: Switti produces high quality and aesthetic $1024{\times}1024$ image samples in around $0.5$ seconds.
  • Figure 2: Transformer block in the Switti model.
  • Figure 3: Last transformer block activation norms over training. Casting the prediction head to full-precision reduces the norm growth. "Sandwich"-normalization further mitigates the issue.
  • Figure 4: Evaluation of $d{=}20$ models on COCO 30K. Using the non-causal attention mask also slightly improves the performance.
  • Figure 5: Visualization of the block-wise self-attention masks in VAR (Left) and Switti (Right).
  • ...and 17 more figures