A Practical Investigation of Spatially-Controlled Image Generation with Transformers
Guoxuan Xia, Harleen Hanspal, Petru-Daniel Tudosiu, Shifeng Zhang, Sarah Parisot
TL;DR
The paper addresses the challenge of achieving reliable spatial control in transformer-based image generation. It evaluates both diffusion/flow (SiT) and Visual Autoregressive Modelling (VAR) on ImageNet, proposing a simple prefilling baseline with control tokens that works across paradigms. It shows that sampling-time refinements such as classifier-free guidance, control guidance, and softmax truncation can improve control-consistency and sometimes image quality, with careful tuning required. It also clarifies the role of adapters, showing they help preserve generation ability and reduce forgetting when data is limited, though they generally underperform fully-finetuned models on control fidelity. Overall, the work provides practical guidance for practitioners and researchers on how to balance control quality, visual realism, inference cost, and data availability in transformer-based spatially conditioned generation.
Abstract
Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency. Finally, we re-clarify the motivation of adapter-based approaches, demonstrating that they mitigate "forgetting" and maintain generation quality when trained on limited downstream data, but underperform full training in terms of generation-control consistency.
