CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance
Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Lladós, Xiatian Zhu, Anjan Dutta
TL;DR
This work tackles the challenge of generating images with explicit object counts in dense scenes using diffusion models. It introduces CountLoop, a training-free iterative framework that uses a Design VLM to build a planning graph and ground a layout with progressive image synthesis, and a Critic VLM to assess count fidelity and aesthetics and steer refinements. The method employs instance-aware cumulative attention to prevent semantic leakage and iteratively refines layouts and prompts until target counts are met, achieving state-of-the-art counting on COCO-Count and high-density benchmarks CountLoop-S and CountLoop-M while preserving visual quality. The approach enables scalable count-controlled generation with practical applications in data augmentation, game design, and pretraining tasks for video foundation models.
Abstract
Diffusion models have shown remarkable progress in photorealistic image synthesis, yet they remain unreliable for generating scenes with a precise number of object instances, particularly in complex and high-density settings. We present CountLoop, a training-free framework that provides diffusion models with accurate instance control through iterative structured feedback. The approach alternates between image generation and multimodal agent evaluation, where a language-guided planner and critic assess object counts, spatial arrangements, and attribute consistency. This feedback is then used to refine layouts and guide subsequent generations. To further improve separation between objects, especially in occluded scenes, we introduce instance-driven attention masking and compositional generation techniques. Experiments on COCO Count, T2I CompBench, and two new high-instance benchmarks show that CountLoop achieves counting accuracy of up to 98% while maintaining spatial fidelity and visual quality, outperforming layout-based and gradient-guided baselines with a score of 0.97.
