Table of Contents
Fetching ...

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation

Tianyidan Xie, Rui Ma, Qian Wang, Xiaoqian Ye, Feixuan Liu, Ying Tai, Zhenyu Zhang, Lanjun Wang, Zili Yi

TL;DR

Foreground-conditioned image generation remains challenging due to object integrity, foreground-background inconsistencies, limited diversity, and textual misalignment in end-to-end inpainting. Anywhere introduces a modular multi-agent framework that decouples foreground understanding, prompt optimization, template generation, and quality control, leveraging LLMs and VLMs for robust guidance and feedback. The approach yields clear gains in fidelity, diversity, and controllability, with reported improvements including a 4.6% reduction in FID, a 24% increase in user preference, a 33% rise in diversity, a 5% gain in text-image matching, and a 44% drop in bad cases, albeit at 2–3× higher GPU time. This framework is extensible and capable of benefiting from future advances in its individual agents, enabling more reliable and customizable foreground-conditioned image generation.

Abstract

Recent advancements in image-conditioned image generation have demonstrated substantial progress. However, foreground-conditioned image generation remains underexplored, encountering challenges such as compromised object integrity, foreground-background inconsistencies, limited diversity, and reduced control flexibility. These challenges arise from current end-to-end inpainting models, which suffer from inaccurate training masks, limited foreground semantic understanding, data distribution biases, and inherent interference between visual and textual prompts. To overcome these limitations, we present Anywhere, a multi-agent framework that departs from the traditional end-to-end approach. In this framework, each agent is specialized in a distinct aspect, such as foreground understanding, diversity enhancement, object integrity protection, and textual prompt consistency. Our framework is further enhanced with the ability to incorporate optional user textual inputs, perform automated quality assessments, and initiate re-generation as needed. Comprehensive experiments demonstrate that this modular design effectively overcomes the limitations of existing end-to-end models, resulting in higher fidelity, quality, diversity and controllability in foreground-conditioned image generation. Additionally, the Anywhere framework is extensible, allowing it to benefit from future advancements in each individual agent.

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation

TL;DR

Foreground-conditioned image generation remains challenging due to object integrity, foreground-background inconsistencies, limited diversity, and textual misalignment in end-to-end inpainting. Anywhere introduces a modular multi-agent framework that decouples foreground understanding, prompt optimization, template generation, and quality control, leveraging LLMs and VLMs for robust guidance and feedback. The approach yields clear gains in fidelity, diversity, and controllability, with reported improvements including a 4.6% reduction in FID, a 24% increase in user preference, a 33% rise in diversity, a 5% gain in text-image matching, and a 44% drop in bad cases, albeit at 2–3× higher GPU time. This framework is extensible and capable of benefiting from future advances in its individual agents, enabling more reliable and customizable foreground-conditioned image generation.

Abstract

Recent advancements in image-conditioned image generation have demonstrated substantial progress. However, foreground-conditioned image generation remains underexplored, encountering challenges such as compromised object integrity, foreground-background inconsistencies, limited diversity, and reduced control flexibility. These challenges arise from current end-to-end inpainting models, which suffer from inaccurate training masks, limited foreground semantic understanding, data distribution biases, and inherent interference between visual and textual prompts. To overcome these limitations, we present Anywhere, a multi-agent framework that departs from the traditional end-to-end approach. In this framework, each agent is specialized in a distinct aspect, such as foreground understanding, diversity enhancement, object integrity protection, and textual prompt consistency. Our framework is further enhanced with the ability to incorporate optional user textual inputs, perform automated quality assessments, and initiate re-generation as needed. Comprehensive experiments demonstrate that this modular design effectively overcomes the limitations of existing end-to-end models, resulting in higher fidelity, quality, diversity and controllability in foreground-conditioned image generation. Additionally, the Anywhere framework is extensible, allowing it to benefit from future advancements in each individual agent.
Paper Structure (41 sections, 13 figures, 4 tables, 3 algorithms)

This paper contains 41 sections, 13 figures, 4 tables, 3 algorithms.

Figures (13)

  • Figure 1: Comparison of our Anywhere framework with inpainting models for foreground-conditioned image generation. The left section highlights the limitations of existing inpainting models, while the right section showcases our results. Our approach effectively addresses the issues (e.g., violated object integrity, foreground-background inconsistency, limited diversity, and compromised textual consistency), producing foreground-preserved, semantically coherent, diverse and text-consistent backgrounds tailored to the given foreground objects.
  • Figure 2: Overview of the Anywhere framework. (a) Our approach comprises three main components: the Prompt Generation Module, the Image Generation Module, and the Quality Evaluator (Agent). The Prompt Generation Module uses a Foreground Analyzer (VLM) to extract textual descriptions from the foreground and a Prompt Creator (LLM) to generate multiple textual prompts based on the foreground descriptions and the user textual inputs if provided. The multiple textual prompts are then assessed by the Prompt Selector (LLM) and the best matched prompt will be selected. The Image Generation module includes a Template Generator (edge-guided image generation model) that generates a template image based on the textual prompt, a Template Repainter that detects object integrity violations (highlighted in green) and resolves the violations if needed, and an Image Enhancer (high-resolution image refinement Model) to paste-back the foreground and harmonize the final output. The Quality Evaluator Agent (VLM) assesses the resulting image, providing descriptive feedback and triggering re-generation when needed. (b) Illustration of the Template Repainter that performs violation detection by foreground segmentation and mask contrasting, and inpaints violated regions if they exist. (c) Illustration of template repainting tools used in the framework.
  • Figure 3: We compare our approach to advanced inpainting models on foreground-conditioned image generation tasks in both text-free (I2I) and text-guided (TI2I) scenarios. These results are generated using unconstrained, in-the-wild foreground images. Red color indicates missing elements in generated images. The inpainting models used for comparison include HD-Painter (HDP), BrushNet (BN), and Stable Diffusion 2.0 Inpainting (SDI).
  • Figure 4: Ablation studies on the Template Repainter.
  • Figure A1: Impact of Quality Evaluator iterations on performance metrics. This graph illustrates the normalized change in various evaluation metrics across multiple rounds of Quality Evaluator feedback. Starting from the baseline (0 rounds), we show how these metrics evolve through 5 iterations, demonstrating the trade-off between quality improvement and computational cost. The results indicate that three iterations provide an optimal balance between performance gains and efficiency.
  • ...and 8 more figures