Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
Anton Voronov, Denis Kuznedelev, Mikhail Khoroshikh, Valentin Khrulkov, Dmitry Baranchuk
TL;DR
Switti tackles the speed-accuracy trade-off in text-to-image synthesis by adopting scale-wise autoregressive transformers and removing explicit causality, yielding a non-causal, highly efficient generator. It leverages a tuned RQ-VAE tokenizer, dual text encoders, and targeted architectural tweaks to stabilize training and improve alignment, while disabling classifier-free guidance at high resolutions further accelerates sampling and enhances detail. Across automated and human evaluations, Switti matches or exceeds autoregressive baselines and remains competitive with diffusion models, while offering up to 7x faster sampling and substantial memory savings. The work also analyzes text conditioning across scales, showing reduced reliance on text at high resolutions and proposing practical CFG-ablations that boost efficiency with minimal quality loss.
Abstract
This work presents Switti, a scale-wise transformer for text-to-image generation. We start by adapting an existing next-scale prediction autoregressive (AR) architecture to T2I generation, investigating and mitigating training stability issues in the process. Next, we argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ~21% faster sampling and lower memory usage while also achieving slightly better generation quality. Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. By disabling guidance at these scales, we achieve an additional sampling acceleration of ~32% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.
