Table of Contents
Fetching ...

Controllable Image Generation With Composed Parallel Token Prediction

Jamie Stirling, Noura Al-Moubayed

TL;DR

This work introduces a principled framework for controllable image generation by composing the log-probabilities of discrete latent-space models. It derives exact formulas for conjunction, negation, and weighted conditioning, and shows how to apply them to parallel token prediction within VQ-VAE/VQ-GAN pipelines. Across FFHQ, Positional CLEVR, and Relational CLEVR, the method achieves state-of-the-art compositional accuracy while maintaining competitive FID and delivering substantial speedups over continuous methods. The approach supports out-of-distribution generalisation and can be applied to open pre-trained models without fine-tuning, highlighting its practical impact for interpretable, efficient controllable generation. Limitations include independence assumptions and increased compute with more conditions, but the results demonstrate strong generalisation, controllability, and potential for broader applicability in discrete generative tasks.

Abstract

Compositional image generation requires models to generalise well in situations where two or more input concepts do not necessarily appear together in training (compositional generalisation). Despite recent progress in compositional image generation via composing continuous sampling processes such as diffusion and energy-based models, composing discrete generative processes has remained an open challenge, with the promise of providing improvements in efficiency, interpretability and simplicity. To this end, we propose a formulation for controllable conditional generation of images via composing the log-probability outputs of discrete generative models of the latent space. Our approach, when applied alongside VQ-VAE and VQ-GAN, achieves state-of-the-art generation accuracy in three distinct settings (FFHQ, Positional CLEVR and Relational CLEVR) while attaining competitive Fréchet Inception Distance (FID) scores. Our method attains an average generation accuracy of $80.71\%$ across the studied settings. Our method also outperforms the next-best approach (ranked by accuracy) in terms of FID in seven out of nine experiments, with an average FID of $24.23$ (an average improvement of $-9.58$). Furthermore, our method offers a $2.3\times$ to $12\times$ speedup over comparable continuous compositional methods on our hardware. We find that our method can generalise to combinations of input conditions that lie outside the training data (e.g. more objects per image) in addition to offering an interpretable dimension of controllability via concept weighting. We further demonstrate that our approach can be readily applied to an open pre-trained discrete text-to-image model without any fine-tuning, allowing for fine-grained control of text-to-image generation.

Controllable Image Generation With Composed Parallel Token Prediction

TL;DR

This work introduces a principled framework for controllable image generation by composing the log-probabilities of discrete latent-space models. It derives exact formulas for conjunction, negation, and weighted conditioning, and shows how to apply them to parallel token prediction within VQ-VAE/VQ-GAN pipelines. Across FFHQ, Positional CLEVR, and Relational CLEVR, the method achieves state-of-the-art compositional accuracy while maintaining competitive FID and delivering substantial speedups over continuous methods. The approach supports out-of-distribution generalisation and can be applied to open pre-trained models without fine-tuning, highlighting its practical impact for interpretable, efficient controllable generation. Limitations include independence assumptions and increased compute with more conditions, but the results demonstrate strong generalisation, controllability, and potential for broader applicability in discrete generative tasks.

Abstract

Compositional image generation requires models to generalise well in situations where two or more input concepts do not necessarily appear together in training (compositional generalisation). Despite recent progress in compositional image generation via composing continuous sampling processes such as diffusion and energy-based models, composing discrete generative processes has remained an open challenge, with the promise of providing improvements in efficiency, interpretability and simplicity. To this end, we propose a formulation for controllable conditional generation of images via composing the log-probability outputs of discrete generative models of the latent space. Our approach, when applied alongside VQ-VAE and VQ-GAN, achieves state-of-the-art generation accuracy in three distinct settings (FFHQ, Positional CLEVR and Relational CLEVR) while attaining competitive Fréchet Inception Distance (FID) scores. Our method attains an average generation accuracy of across the studied settings. Our method also outperforms the next-best approach (ranked by accuracy) in terms of FID in seven out of nine experiments, with an average FID of (an average improvement of ). Furthermore, our method offers a to speedup over comparable continuous compositional methods on our hardware. We find that our method can generalise to combinations of input conditions that lie outside the training data (e.g. more objects per image) in addition to offering an interpretable dimension of controllability via concept weighting. We further demonstrate that our approach can be readily applied to an open pre-trained discrete text-to-image model without any fine-tuning, allowing for fine-grained control of text-to-image generation.
Paper Structure (38 sections, 14 equations, 15 figures, 4 tables)

This paper contains 38 sections, 14 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Overview of our approach. At each generation step, unconditional and conditional unmasking probabilities are obtained, conditioned on the unmasked state and input attributes. Next, our discrete compositional framework is applied, before sampling from the resulting distribution and unmasking a random selection of tokens. This is repeated until a fully unmasked representation of an image is obtained, which is finally decoded into an image.
  • Figure 2: Scatter plots of compositional generation error vs FID on 3 datasets (3 input components): Our method lies on the Pareto front of all results (see Appendix \ref{['app:error_fid']} for full scatter plots) while achieving lowest or joint lowest error among the baselines.
  • Figure 3: Compositional text-to-image results with captions (zooming recommended). Our framework allows the composition of multiple prompts, for fine-grained control of outputs and minimal extra memory requirements.
  • Figure 4: Concept negation with text-to-image: Our method allows more precise control over the outputs of an existing pre-trained model (aMUSEd patil2024amused). Images are in pairs with the baseline result on the left and our method on the right.
  • Figure 5: Compositional out-of-distribution generation: Positional CLEVR training images contain no more than 5 objects per image, but our compositional method allows 6 or more objects to appear in the same image via compositional sampling.
  • ...and 10 more figures