Table of Contents
Fetching ...

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

Jiatao Gu, Ying Shen, Shuangfei Zhai, Yizhe Zhang, Navdeep Jaitly, Joshua M. Susskind

TL;DR

Kaleido Diffusion tackles the diversity gap in diffusion-based image generation that widens when applying strong classifier-free guidance. By introducing autoregressive latent priors that produce discrete latent tokens (text, bbox, blob, voken) from the caption and jointly training a latent-augmented diffusion model, Kaleido enriches conditioning without sacrificing quality. The method enables interpretable, stepwise latent editing and fine-grained control over outputs, demonstrated through quantitative gains in recall and stable FID across CFG scales, plus qualitative diversity and controllability. This approach broadens practical applicability of text-to-image diffusion, supporting more diverse and customizable visual interpretations of a given prompt.

Abstract

Diffusion models have emerged as a powerful tool for generating high-quality images from textual descriptions. Despite their successes, these models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. To address this issue, we present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors. Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables, serving as abstract and intermediary representations for guiding and facilitating the image generation process. In this paper, we explore a variety of discrete latent representations, including textual descriptions, detection bounding boxes, object blobs, and visual tokens. These representations diversify and enrich the input conditions to the diffusion models, enabling more diverse outputs. Our experimental results demonstrate that Kaleido effectively broadens the diversity of the generated image samples from a given textual description while maintaining high image quality. Furthermore, we show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

TL;DR

Kaleido Diffusion tackles the diversity gap in diffusion-based image generation that widens when applying strong classifier-free guidance. By introducing autoregressive latent priors that produce discrete latent tokens (text, bbox, blob, voken) from the caption and jointly training a latent-augmented diffusion model, Kaleido enriches conditioning without sacrificing quality. The method enables interpretable, stepwise latent editing and fine-grained control over outputs, demonstrated through quantitative gains in recall and stable FID across CFG scales, plus qualitative diversity and controllability. This approach broadens practical applicability of text-to-image diffusion, supporting more diverse and customizable visual interpretations of a given prompt.

Abstract

Diffusion models have emerged as a powerful tool for generating high-quality images from textual descriptions. Despite their successes, these models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. To address this issue, we present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors. Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables, serving as abstract and intermediary representations for guiding and facilitating the image generation process. In this paper, we explore a variety of discrete latent representations, including textual descriptions, detection bounding boxes, object blobs, and visual tokens. These representations diversify and enrich the input conditions to the diffusion models, enabling more diverse outputs. Our experimental results demonstrate that Kaleido effectively broadens the diversity of the generated image samples from a given textual description while maintaining high image quality. Furthermore, we show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.
Paper Structure (40 sections, 10 equations, 14 figures, 1 table)

This paper contains 40 sections, 10 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Comparison of the generated image samples given the caption "a cat sat on the mat". Our models generate more diverse images with the help of autoregressive latent modeling.
  • Figure 2: Training pipeline of the proposed Kaleido diffusion.
  • Figure 3: Effect of augmented latents. The first row displays the sampling results from the standard diffusion model, while the second row shows the results from the latent-augmented diffusion models.
  • Figure 4: A Variety of Discrete Tokens. Original caption: "Dog laying on a human's lap"
  • Figure 5: Comparison with guidance weights.
  • ...and 9 more figures