Table of Contents
Fetching ...

GLoD: Composing Global Contexts and Local Details in Image Generation

Moyuru Yamada

TL;DR

GLoD addresses the challenge of jointly controlling global contexts and local details in text-to-image diffusion by decomposing prompts into global and local components and composing their noises across layered prompts. It operates without training or fine-tuning, using global guidance and region-aware local guidance to steer a pre-trained diffusion model, along with backward guidance for layout. The approach enables both global-global and global-local compositions, preserves unspecified identities, and supports layered editing in a single inference. Quantitative and qualitative evaluations show improved alignment with global interactions and local attributes, validating the method's effectiveness for complex scene synthesis and editing with minimal overhead and broad applicability to pre-trained models. A potential limitation is partial attribute transfer when object latent representations diverge significantly between global and local prompts.

Abstract

Diffusion models have demonstrated their capability to synthesize high-quality and diverse images from textual prompts. However, simultaneous control over both global contexts (e.g., object layouts and interactions) and local details (e.g., colors and emotions) still remains a significant challenge. The models often fail to understand complex descriptions involving multiple objects and reflect specified visual attributes to wrong targets or ignore them. This paper presents Global-Local Diffusion (\textit{GLoD}), a novel framework which allows simultaneous control over the global contexts and the local details in text-to-image generation without requiring training or fine-tuning. It assigns multiple global and local prompts to corresponding layers and composes their noises to guide a denoising process using pre-trained diffusion models. Our framework enables complex global-local compositions, conditioning objects in the global prompt with the local prompts while preserving other unspecified identities. Our quantitative and qualitative evaluations demonstrate that GLoD effectively generates complex images that adhere to both user-provided object interactions and object details.

GLoD: Composing Global Contexts and Local Details in Image Generation

TL;DR

GLoD addresses the challenge of jointly controlling global contexts and local details in text-to-image diffusion by decomposing prompts into global and local components and composing their noises across layered prompts. It operates without training or fine-tuning, using global guidance and region-aware local guidance to steer a pre-trained diffusion model, along with backward guidance for layout. The approach enables both global-global and global-local compositions, preserves unspecified identities, and supports layered editing in a single inference. Quantitative and qualitative evaluations show improved alignment with global interactions and local attributes, validating the method's effectiveness for complex scene synthesis and editing with minimal overhead and broad applicability to pre-trained models. A potential limitation is partial attribute transfer when object latent representations diverge significantly between global and local prompts.

Abstract

Diffusion models have demonstrated their capability to synthesize high-quality and diverse images from textual prompts. However, simultaneous control over both global contexts (e.g., object layouts and interactions) and local details (e.g., colors and emotions) still remains a significant challenge. The models often fail to understand complex descriptions involving multiple objects and reflect specified visual attributes to wrong targets or ignore them. This paper presents Global-Local Diffusion (\textit{GLoD}), a novel framework which allows simultaneous control over the global contexts and the local details in text-to-image generation without requiring training or fine-tuning. It assigns multiple global and local prompts to corresponding layers and composes their noises to guide a denoising process using pre-trained diffusion models. Our framework enables complex global-local compositions, conditioning objects in the global prompt with the local prompts while preserving other unspecified identities. Our quantitative and qualitative evaluations demonstrate that GLoD effectively generates complex images that adhere to both user-provided object interactions and object details.
Paper Structure (19 sections, 6 equations, 11 figures, 1 table, 1 algorithm)

This paper contains 19 sections, 6 equations, 11 figures, 1 table, 1 algorithm.

Figures (11)

  • Figure 1: Global-Local Diffusion (GLoD) takes multiple prompts as an input (e.g., a global prompt: 'a man is talking with a woman' and two local prompts: 'a man with white beard' and 'a woman is wearing a necklace and smiling') along with their layout and assigns noises obtained from them into corresponding layers with a pre-trained diffusion model. Then, the noises are effectively composed to generate an image. Details of objects in the global prompt are guided with the corresponding local prompts.
  • Figure 2: GLoD enables controlling global contexts (interaction between a dog and a man, their layouts) and local details (the dog is black, the man is wearing a blue shirt) independently. Local details can be specified (black dog $\rightarrow$ Husky dog) while preserving the global contexts. Note that this is not image editing. We generate images from the text prompts and the layout.
  • Figure 3: GLoD composes multiple layers. Unconditional noise and noises conditioned on global contexts (e.g., interactions) or local details (e.g., color) are assigned to separate layers ($l_0$, $l_1$, $l_2$). Those layers are then composed with global guidance $g_g$ and local guidance $g_l$.
  • Figure 4: GLoD for a single object. The images in the first column and 'Local' columns are sampled only from the global context (global images) and the local detail (local images) as an input prompt, respectively. The images in 'Composed' columns are sampled using our method, which effectively applies local detail (e.g., long-haired) to the object in the image while preserving the global contexts (i.e., object layouts and object postures).
  • Figure 5: GLoD for multiple objects. Our method (e) can control attributes of each sheep, while the other methods fail to reflect the specified attributes to the correct targets.
  • ...and 6 more figures