Table of Contents
Fetching ...

CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance

Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Lladós, Xiatian Zhu, Anjan Dutta

TL;DR

This work tackles the challenge of generating images with explicit object counts in dense scenes using diffusion models. It introduces CountLoop, a training-free iterative framework that uses a Design VLM to build a planning graph and ground a layout with progressive image synthesis, and a Critic VLM to assess count fidelity and aesthetics and steer refinements. The method employs instance-aware cumulative attention to prevent semantic leakage and iteratively refines layouts and prompts until target counts are met, achieving state-of-the-art counting on COCO-Count and high-density benchmarks CountLoop-S and CountLoop-M while preserving visual quality. The approach enables scalable count-controlled generation with practical applications in data augmentation, game design, and pretraining tasks for video foundation models.

Abstract

Diffusion models have shown remarkable progress in photorealistic image synthesis, yet they remain unreliable for generating scenes with a precise number of object instances, particularly in complex and high-density settings. We present CountLoop, a training-free framework that provides diffusion models with accurate instance control through iterative structured feedback. The approach alternates between image generation and multimodal agent evaluation, where a language-guided planner and critic assess object counts, spatial arrangements, and attribute consistency. This feedback is then used to refine layouts and guide subsequent generations. To further improve separation between objects, especially in occluded scenes, we introduce instance-driven attention masking and compositional generation techniques. Experiments on COCO Count, T2I CompBench, and two new high-instance benchmarks show that CountLoop achieves counting accuracy of up to 98% while maintaining spatial fidelity and visual quality, outperforming layout-based and gradient-guided baselines with a score of 0.97.

CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance

TL;DR

This work tackles the challenge of generating images with explicit object counts in dense scenes using diffusion models. It introduces CountLoop, a training-free iterative framework that uses a Design VLM to build a planning graph and ground a layout with progressive image synthesis, and a Critic VLM to assess count fidelity and aesthetics and steer refinements. The method employs instance-aware cumulative attention to prevent semantic leakage and iteratively refines layouts and prompts until target counts are met, achieving state-of-the-art counting on COCO-Count and high-density benchmarks CountLoop-S and CountLoop-M while preserving visual quality. The approach enables scalable count-controlled generation with practical applications in data augmentation, game design, and pretraining tasks for video foundation models.

Abstract

Diffusion models have shown remarkable progress in photorealistic image synthesis, yet they remain unreliable for generating scenes with a precise number of object instances, particularly in complex and high-density settings. We present CountLoop, a training-free framework that provides diffusion models with accurate instance control through iterative structured feedback. The approach alternates between image generation and multimodal agent evaluation, where a language-guided planner and critic assess object counts, spatial arrangements, and attribute consistency. This feedback is then used to refine layouts and guide subsequent generations. To further improve separation between objects, especially in occluded scenes, we introduce instance-driven attention masking and compositional generation techniques. Experiments on COCO Count, T2I CompBench, and two new high-instance benchmarks show that CountLoop achieves counting accuracy of up to 98% while maintaining spatial fidelity and visual quality, outperforming layout-based and gradient-guided baselines with a score of 0.97.

Paper Structure

This paper contains 20 sections, 12 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: Given prompts with explicit per-class counts, CountLoop (top-left) produces high-instance images whose detected counts align with targets. Under identical prompts, recent text/layout-based image generation benchmarks often under- or over-generate at high cardinalities. We further illustrate practical uses of count-specific image generation (right): (a) in object counting ranjan2021learning, for augmenting datasets; (b) in AI-driven games microsoft2025muse, where accurate object counts (e.g., buildings, cards) are crucial for gameplay design; and (c) in video foundation model pre-training wan2025wanhong2022cogvideo, where synthetic count images can enhance diversity and generalization compared to scarce real-world counting datasets.
  • Figure 2: Issues in High-instance image generation
  • Figure 3: Given a text prompt, ⓐ The Design VLM parses the prompt to construct a planning graph, which is converted into a pixel-aligned layout ⓑ. ⓒ This layout guides an IP-Adapter-enhanced T2I backbone for image generation. ⓓ A Critic VLM evaluates the generated image's count and aesthetics, providing structured feedback to update the planning graph. ⓔ This iterative loop continues until objectives are met.
  • Figure 4: Cumulative latent composition, along with disentangled query feature extraction, mitigates attribute leakage
  • Figure 5: Successive layout refinement using VLM critic. Corresponding layouts in the inset.
  • ...and 11 more figures