Table of Contents
Fetching ...

Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance

Dongmin Park, Sebin Kim, Taehong Moon, Minkyu Kim, Kangwook Lee, Jaewoong Cho

TL;DR

The paper tackles the challenge of generating rare concept compositions with text-to-image diffusion models. It introduces R2F, a training-free framework that uses LLM guidance to map rare concepts to more frequent, easier-to-ground concepts and to alternate prompts during diffusion. A theoretical analysis based on a Gaussian score interpolation justifies why exposing frequent concepts helps when data for rare concepts is limited. Experiments on RareBench and other benchmarks show up to 28.1 percentage points improvement in T2I alignment, and extendable to region-guided generation via R2F+.

Abstract

State-of-the-art text-to-image (T2I) diffusion models often struggle to generate rare compositions of concepts, e.g., objects with unusual attributes. In this paper, we show that the compositional generation power of diffusion models on such rare concepts can be significantly enhanced by the Large Language Model (LLM) guidance. We start with empirical and theoretical analysis, demonstrating that exposing frequent concepts relevant to the target rare concepts during the diffusion sampling process yields more accurate concept composition. Based on this, we propose a training-free approach, R2F, that plans and executes the overall rare-to-frequent concept guidance throughout the diffusion inference by leveraging the abundant semantic knowledge in LLMs. Our framework is flexible across any pre-trained diffusion models and LLMs, and can be seamlessly integrated with the region-guided diffusion approaches. Extensive experiments on three datasets, including our newly proposed benchmark, RareBench, containing various prompts with rare compositions of concepts, R2F significantly surpasses existing models including SD3.0 and FLUX by up to 28.1%p in T2I alignment. Code is available at https://github.com/krafton-ai/Rare-to-Frequent.

Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance

TL;DR

The paper tackles the challenge of generating rare concept compositions with text-to-image diffusion models. It introduces R2F, a training-free framework that uses LLM guidance to map rare concepts to more frequent, easier-to-ground concepts and to alternate prompts during diffusion. A theoretical analysis based on a Gaussian score interpolation justifies why exposing frequent concepts helps when data for rare concepts is limited. Experiments on RareBench and other benchmarks show up to 28.1 percentage points improvement in T2I alignment, and extendable to region-guided generation via R2F+.

Abstract

State-of-the-art text-to-image (T2I) diffusion models often struggle to generate rare compositions of concepts, e.g., objects with unusual attributes. In this paper, we show that the compositional generation power of diffusion models on such rare concepts can be significantly enhanced by the Large Language Model (LLM) guidance. We start with empirical and theoretical analysis, demonstrating that exposing frequent concepts relevant to the target rare concepts during the diffusion sampling process yields more accurate concept composition. Based on this, we propose a training-free approach, R2F, that plans and executes the overall rare-to-frequent concept guidance throughout the diffusion inference by leveraging the abundant semantic knowledge in LLMs. Our framework is flexible across any pre-trained diffusion models and LLMs, and can be seamlessly integrated with the region-guided diffusion approaches. Extensive experiments on three datasets, including our newly proposed benchmark, RareBench, containing various prompts with rare compositions of concepts, R2F significantly surpasses existing models including SD3.0 and FLUX by up to 28.1%p in T2I alignment. Code is available at https://github.com/krafton-ai/Rare-to-Frequent.

Paper Structure

This paper contains 33 sections, 3 theorems, 11 equations, 20 figures, 18 tables, 1 algorithm.

Key Result

Theorem 3.1

Given the above setting, consider the linear interpolated score estimator for the rare concept as: This interpolated score function corresponds to the score function of the Gaussian distribution ${\mathcal{N}}({{\bm{\mu}}}_{\text{lerp}},{{\bm{\Sigma}}}_{\text{lerp}})$ where ${\bm{\mu}}_{\text{l}erp}=\alpha{\bm{\mu}}_R+(1-\alpha){\bm{\mu}}_F$, and ${\bm{\Sigma}}_{\text{l}erp}^{-1}=\alpha \hat{{\bm

Figures (20)

  • Figure 1: Generated images from prompts with rare compositions of concepts ($=$attribute$+$object; highlighted in red). These objects possess attributes not typically associated with them, making such combinations difficult to observe. While state-of-the-art pre-trained and LLM-grounded text-to-image diffusion models, SD3.0 esser2024scaling, FLUX FLUX, and RPG yang2024mastering, struggle to generate such concepts, our training-free approach, R2F, exhibits superior results.
  • Figure 2: (a) shows image generation quality on a rare composition of two concepts; "flower-patterned" and some "animal" (randomly sampled from ImageNet classes). Naive inferences with SD3.0 (red line) tend to be inaccurate when the composition becomes rarer (animal classes rarely appear on LAION dataset). Interestingly, once we guide the inference with a relatively frequent composition ("flower-patterned bear", which is easily generated as "bear doll") at the early sampling steps and then turn back to the original prompt, the generation quality is significantly enhanced (blue line). (b) shows the key idea of our framework with LLM guidance.
  • Figure 3: Visualizing distributions in rare concept generation. (a) The true data distribution conditioned on the rare concept ${\bm{c}}_R$ (e.g., ("furry", "frog")), modeled as ${\mathcal{N}}((0,0), {\bm{I}}_2)$; (b) Initial estimated distribution for the rare concept ${\bm{c}}_R$, ${\mathcal{N}}((0,0), \text{diag}(20^2, 1))$, with high uncertainty along $x_1$ (red). The green points represent the estimated distribution for the frequent concept ${\bm{c}}_F$ (e.g., ("furry", "dog")), ${\mathcal{N}}((0,10), {\bm{I}}_2)$; (c) The distribution generated via linear interpolation of the score functions, $p_{\text{lerp}}({\bm{x}}|{\bm{c}}_R; \alpha = 0.8)$, which combines information from both the rare and frequent concepts, yielding better approximation of the rare concept; (d) 2-Wasserstein distance between the $p_{\text{lerp}}({\bm{x}}|{\bm{c}}_R; \alpha)$ and the target distribution (blue line). The distance shows that a well-chosen $\alpha$ improves the approximation compared to using only the rare concept score function (red dashed line).
  • Figure 4: Overview of our R2F framework.
  • Figure 4: Performance of R2F combined with different diffusion models (SDXL, IterComp, and SD3.0) on RareBench.
  • ...and 15 more figures

Theorems & Definitions (5)

  • Theorem 3.1: Improved rare concept generation via linear interpolation between score functions
  • proof
  • Theorem A.1: Improved rare concept generation via linear interpolation between score functions
  • Theorem A.1: Improved rare concept generation via linear interpolation between score functions
  • proof