Table of Contents
Fetching ...

ReGround: Improving Textual and Spatial Grounding at No Cost

Phillip Y. Lee, Minhyuk Sung

TL;DR

This paper tackles the textual and spatial grounding trade-off in diffusion-based text-to-image generation when spatial cues (bounding boxes) are provided. It reveals that GLIGEN’s gated self-attention introduces a bias toward spatial grounding due to a sequential flow before cross-attention, causing description omission in prompts. The authors propose ReGround, a simple, inference-time rewiring that switches the attention modules from sequential to parallel, enabling simultaneous textual and spatial grounding without fine-tuning or extra parameters. Across MS-COCO and NSR-1K-GPT, ReGround delivers notably better textual grounding (CLIP) with minimal loss in spatial grounding (YOLO) and even improves image quality (FID); the approach generalizes to other GLIGEN-based backbones like BoxDiff and Attn-Refocus.

Abstract

When an image generation process is guided by both a text prompt and spatial cues, such as a set of bounding boxes, do these elements work in harmony, or does one dominate the other? Our analysis of a pretrained image diffusion model that integrates gated self-attention into the U-Net reveals that spatial grounding often outweighs textual grounding due to the sequential flow from gated self-attention to cross-attention. We demonstrate that such bias can be significantly mitigated without sacrificing accuracy in either grounding by simply rewiring the network architecture, changing from sequential to parallel for gated self-attention and cross-attention. This surprisingly simple yet effective solution does not require any fine-tuning of the network but significantly reduces the trade-off between the two groundings. Our experiments demonstrate significant improvements from the original GLIGEN to the rewired version in the trade-off between textual grounding and spatial grounding.

ReGround: Improving Textual and Spatial Grounding at No Cost

TL;DR

This paper tackles the textual and spatial grounding trade-off in diffusion-based text-to-image generation when spatial cues (bounding boxes) are provided. It reveals that GLIGEN’s gated self-attention introduces a bias toward spatial grounding due to a sequential flow before cross-attention, causing description omission in prompts. The authors propose ReGround, a simple, inference-time rewiring that switches the attention modules from sequential to parallel, enabling simultaneous textual and spatial grounding without fine-tuning or extra parameters. Across MS-COCO and NSR-1K-GPT, ReGround delivers notably better textual grounding (CLIP) with minimal loss in spatial grounding (YOLO) and even improves image quality (FID); the approach generalizes to other GLIGEN-based backbones like BoxDiff and Attn-Refocus.

Abstract

When an image generation process is guided by both a text prompt and spatial cues, such as a set of bounding boxes, do these elements work in harmony, or does one dominate the other? Our analysis of a pretrained image diffusion model that integrates gated self-attention into the U-Net reveals that spatial grounding often outweighs textual grounding due to the sequential flow from gated self-attention to cross-attention. We demonstrate that such bias can be significantly mitigated without sacrificing accuracy in either grounding by simply rewiring the network architecture, changing from sequential to parallel for gated self-attention and cross-attention. This surprisingly simple yet effective solution does not require any fine-tuning of the network but significantly reduces the trade-off between the two groundings. Our experiments demonstrate significant improvements from the original GLIGEN to the rewired version in the trade-off between textual grounding and spatial grounding.
Paper Structure (36 sections, 3 equations, 25 figures, 1 algorithm)

This paper contains 36 sections, 3 equations, 25 figures, 1 algorithm.

Figures (25)

  • Figure 1: Comparison across Stable Diffusion (SD)rombach2022high, GLIGENli2023gligen, and our ReGround. SD (2nd column) can generate an image aligned with the input prompt (shown below each row), while it does not allow taking spatial constraints such as bounding boxes and labels. GLIGEN (3rd column) enables spatial grounding using gated self-attention, although it often disregards some descriptions in the input prompt due to a bias towards bounding box conditions. Such trends also occur when only activating gated self-attention for 0.2 fraction of the initial denoising steps (4th column). Our ReGround (last column) resolves the issue of description omission while accurately reflecting the bounding box information.
  • Figure 2: Comparison between the U-Net architectures of (a) Latent Diffusion Model (LDM) rombach2022high, (b) GLIGEN li2023gligen and (c) our ReGround. From LDM, GLIGEN enables spatial grounding by injecting Gated Self-Attention before cross-attention, forming a sequential flow of them. Based on GLIGEN, our ReGround changes the relationship of the two attention modules to become parallel, resulting in noticeable improvement in textual grounding while preserving the spatial grounding capability. (The residual block before self-attention is omitted.)
  • Figure 3: (a) Images generated by GLIGEN li2023gligen with varying activation duration of gated self-attention $\gamma$ in scheduled sampling (Sec. \ref{['subsec:grounding_trade-off']}). The red words in the text prompt denote the words used as labels of the input bounding boxes. Note that for GLIGEN to reflect the underlined description in the text prompt in the final image, $\gamma$ must be decreased to 0.1, which compromises spatial grounding accuracy. (b) In contrast, our ReGround reflects the underlined phrase even when $\gamma=1.0$, therefore achieving high accuracy in both textual and spatial grounding.
  • Figure 4: Comparison of the output of GLIGEN li2023gligen with and without cross-attention. While the absence of cross-attention reduces realism and quality of the image, the silhouette of objects remains grounded within the given bounding boxes, as shown in the third column of each case.
  • Figure 5: Qualitative comparisons. Stable Diffusion (SD, 2nd column) generates images that align with the given text descriptions, including the underlined phrase in each row, but cannot take bounding boxes as input. GLIGEN (3rd column) creates images that match the input layouts but suffers from description omission, failing to reflect the underlined descriptions. Scheduled sampling strategy (4th column) can partially address this issue (for instance, in the 5th row, where "window" appears in the room), but it results in a noticeable decline in spatial accuracy (as seen in the 1st row, where the tie is not generated). In contrast, our method (last column) accurately incorporates the underlined text descriptions while maintaining precise spatial representation.
  • ...and 20 more figures