Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation
Tianyidan Xie, Rui Ma, Qian Wang, Xiaoqian Ye, Feixuan Liu, Ying Tai, Zhenyu Zhang, Lanjun Wang, Zili Yi
TL;DR
Foreground-conditioned image generation remains challenging due to object integrity, foreground-background inconsistencies, limited diversity, and textual misalignment in end-to-end inpainting. Anywhere introduces a modular multi-agent framework that decouples foreground understanding, prompt optimization, template generation, and quality control, leveraging LLMs and VLMs for robust guidance and feedback. The approach yields clear gains in fidelity, diversity, and controllability, with reported improvements including a 4.6% reduction in FID, a 24% increase in user preference, a 33% rise in diversity, a 5% gain in text-image matching, and a 44% drop in bad cases, albeit at 2–3× higher GPU time. This framework is extensible and capable of benefiting from future advances in its individual agents, enabling more reliable and customizable foreground-conditioned image generation.
Abstract
Recent advancements in image-conditioned image generation have demonstrated substantial progress. However, foreground-conditioned image generation remains underexplored, encountering challenges such as compromised object integrity, foreground-background inconsistencies, limited diversity, and reduced control flexibility. These challenges arise from current end-to-end inpainting models, which suffer from inaccurate training masks, limited foreground semantic understanding, data distribution biases, and inherent interference between visual and textual prompts. To overcome these limitations, we present Anywhere, a multi-agent framework that departs from the traditional end-to-end approach. In this framework, each agent is specialized in a distinct aspect, such as foreground understanding, diversity enhancement, object integrity protection, and textual prompt consistency. Our framework is further enhanced with the ability to incorporate optional user textual inputs, perform automated quality assessments, and initiate re-generation as needed. Comprehensive experiments demonstrate that this modular design effectively overcomes the limitations of existing end-to-end models, resulting in higher fidelity, quality, diversity and controllability in foreground-conditioned image generation. Additionally, the Anywhere framework is extensible, allowing it to benefit from future advancements in each individual agent.
