Table of Contents
Fetching ...

CountCluster: Training-Free Object Quantity Guidance with Cross-Attention Map Clustering for Text-to-Image Generation

Joohyeon Lee, Jin-Seop Lee, Jee-Hyong Lee

TL;DR

CountCluster presents a training-free approach to enforce the number of objects in diffusion-based text-to-image generation by shaping early cross-attention maps. It constructs a CAM-aware target object map from high-activation regions and uses a KL-based loss to align the object CAM with k well-separated clusters at the first denoising timestep. The method requires no external tools or training and demonstrates substantial improvements in object-count accuracy while maintaining image quality across diverse prompts and backbones. This yields a practical, efficient solution for faithful quantity control in text-to-image synthesis with broad applicability to real-world prompts.

Abstract

Diffusion-based text-to-image generation models have demonstrated strong performance in terms of image quality and diversity. However, they still struggle to generate images that accurately reflect the number of objects specified in the input prompt. Several approaches have been proposed that rely on either external counting modules for iterative refinement or quantity representations derived from learned tokens or latent features. However, they still have limitations in accurately reflecting the specified number of objects and overlook an important structural characteristic--The number of object instances in the generated image is largely determined in the early timesteps of the denoising process. To correctly reflect the object quantity for image generation, the highly activated regions in the object cross-attention map at the early timesteps should match the input object quantity, while each region should be clearly separated. To address this issue, we propose \textit{CountCluster}, a method that guides the object cross-attention map to be clustered according to the specified object count in the input, without relying on any external tools or additional training. The proposed method partitions the object cross-attention map into $k$ clusters at inference time based on attention scores, defines an ideal distribution in which each cluster is spatially well-separated, and optimizes the latent to align with this target distribution. Our method achieves an average improvement of 18.5\%p in object count accuracy compared to existing methods, and demonstrates superior quantity control performance across a variety of prompts. Code will be released at: https://github.com/JoohyeonL22/CountCluster

CountCluster: Training-Free Object Quantity Guidance with Cross-Attention Map Clustering for Text-to-Image Generation

TL;DR

CountCluster presents a training-free approach to enforce the number of objects in diffusion-based text-to-image generation by shaping early cross-attention maps. It constructs a CAM-aware target object map from high-activation regions and uses a KL-based loss to align the object CAM with k well-separated clusters at the first denoising timestep. The method requires no external tools or training and demonstrates substantial improvements in object-count accuracy while maintaining image quality across diverse prompts and backbones. This yields a practical, efficient solution for faithful quantity control in text-to-image synthesis with broad applicability to real-world prompts.

Abstract

Diffusion-based text-to-image generation models have demonstrated strong performance in terms of image quality and diversity. However, they still struggle to generate images that accurately reflect the number of objects specified in the input prompt. Several approaches have been proposed that rely on either external counting modules for iterative refinement or quantity representations derived from learned tokens or latent features. However, they still have limitations in accurately reflecting the specified number of objects and overlook an important structural characteristic--The number of object instances in the generated image is largely determined in the early timesteps of the denoising process. To correctly reflect the object quantity for image generation, the highly activated regions in the object cross-attention map at the early timesteps should match the input object quantity, while each region should be clearly separated. To address this issue, we propose \textit{CountCluster}, a method that guides the object cross-attention map to be clustered according to the specified object count in the input, without relying on any external tools or additional training. The proposed method partitions the object cross-attention map into clusters at inference time based on attention scores, defines an ideal distribution in which each cluster is spatially well-separated, and optimizes the latent to align with this target distribution. Our method achieves an average improvement of 18.5\%p in object count accuracy compared to existing methods, and demonstrates superior quantity control performance across a variety of prompts. Code will be released at: https://github.com/JoohyeonL22/CountCluster

Paper Structure

This paper contains 43 sections, 11 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Generated images using SDXL podell2024sdxl for the prompt "A photo of four lemons". Each example, generated under the same conditions with different seeds, shows the cross-attention maps of "lemons" at timesteps $t=50, 45, 40$ together with the final image at $t=0$. (a) and (b) are failure cases with incorrect instance counts, while (c) is a successful case with the correct number of instances. It can be observed that the positions and number of instances are mostly determined at the early timesteps of the denoising process.
  • Figure 2: Comparison of three latent optimization settings for controlling object quantity.
  • Figure 3: The first two rows illustrate cases where only the object quantity in the prompt is changed while keeping the random seed. The last two rows show results where only the object category (e.g., “donuts” → “oranges”) is changed under the same random seed.
  • Figure 4: Accuracy comparison across different object counts. It illustrates the model performance as the number of objects specified in the prompt increases from 2 to 10.
  • Figure 5: Qualitative results on complex object configurations, including coiled, elongated, occluded, and objects in natural backgrounds.
  • ...and 3 more figures