Table of Contents
Fetching ...

A Practical Investigation of Spatially-Controlled Image Generation with Transformers

Guoxuan Xia, Harleen Hanspal, Petru-Daniel Tudosiu, Shifeng Zhang, Sarah Parisot

TL;DR

The paper addresses the challenge of achieving reliable spatial control in transformer-based image generation. It evaluates both diffusion/flow (SiT) and Visual Autoregressive Modelling (VAR) on ImageNet, proposing a simple prefilling baseline with control tokens that works across paradigms. It shows that sampling-time refinements such as classifier-free guidance, control guidance, and softmax truncation can improve control-consistency and sometimes image quality, with careful tuning required. It also clarifies the role of adapters, showing they help preserve generation ability and reduce forgetting when data is limited, though they generally underperform fully-finetuned models on control fidelity. Overall, the work provides practical guidance for practitioners and researchers on how to balance control quality, visual realism, inference cost, and data availability in transformer-based spatially conditioned generation.

Abstract

Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency. Finally, we re-clarify the motivation of adapter-based approaches, demonstrating that they mitigate "forgetting" and maintain generation quality when trained on limited downstream data, but underperform full training in terms of generation-control consistency.

A Practical Investigation of Spatially-Controlled Image Generation with Transformers

TL;DR

The paper addresses the challenge of achieving reliable spatial control in transformer-based image generation. It evaluates both diffusion/flow (SiT) and Visual Autoregressive Modelling (VAR) on ImageNet, proposing a simple prefilling baseline with control tokens that works across paradigms. It shows that sampling-time refinements such as classifier-free guidance, control guidance, and softmax truncation can improve control-consistency and sometimes image quality, with careful tuning required. It also clarifies the role of adapters, showing they help preserve generation ability and reduce forgetting when data is limited, though they generally underperform fully-finetuned models on control fidelity. Overall, the work provides practical guidance for practitioners and researchers on how to balance control quality, visual realism, inference cost, and data availability in transformer-based spatially conditioned generation.

Abstract

Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency. Finally, we re-clarify the motivation of adapter-based approaches, demonstrating that they mitigate "forgetting" and maintain generation quality when trained on limited downstream data, but underperform full training in terms of generation-control consistency.

Paper Structure

This paper contains 35 sections, 13 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Illustration of our prefill baseline for spatial control with transformers. a) The control conditioning is encoded to the generative model's latent space. b) Generative modelling is performed in the latent space using VAR (top) or SiT (bottom). Generative tokens attend back to conditioning control tokens using a block casual mask, allowing for KV-caching at inference. c) Generated tokens are decoded from the latent space to the pixel space.
  • Figure 2: Left: Examples of conditional generations using prefilling. Additional examples can be found in \ref{['app:example-gens']}. Right: Results for our prefill baseline where each model is finetuned from an ImageNet-pretrained base model. "*" indicates results copied from controlarcontrolvar. With default sampling parameters, prefilling is able to (a) generate higher quality images than recent transformer-based approaches with (b) comparable control consistency. We note the (c) relatively low overhead for SiT, as KV-caching means the control tokens use up only a single forward pass. (d) Quantising the control input with VQVAE hurts consistency for VAR, without affecting generation quality.
  • Figure 3: Visualisation of the effect of adjusting sampling parameters, CFG=3.0 (please zoom in for details). Left (ctrl-G): consistency to the canny map is visibly improved, but this may come at the cost of image quality, e.g. over-emphasised edges or incongruous shapes, e.g. note the treatment of what was originally a hexagonal metal wire mesh in the original control image. SiT suffers from saturation artefacts; applying projected guidance (proj-G) ameloriates this issue. Right (distribution truncation): Top-$p$ softmax truncation visibly improves VAR consistency without hurting generation quality. Temperature scaling the score has little visible effect on the generation.
  • Figure 4: The effect of CFG and Ctrl-G on conditional generation. Brighter means better. CFG generally improves generation quality according to FID$\downarrow$ and IS$\uparrow$, with a slight decrease in control consistency (F1$\uparrow$, RMSE$\downarrow$). Ctrl-G significantly improves consistency, but introduces a trade-off against generation quality.
  • Figure 5: Effect of CFG and distribution truncation on conditional generation. Brighter means better. Again, CFG generally improves generation quality, with a slight decrease in consistency. Top-$\boldsymbol p$ softmax truncation improves both generation quality and consistency for VAR, although aggressive truncation may reduce diversity and hurt FID. Score temperature scaling does not produce any meaningful benefit for SiT.
  • ...and 9 more figures