Table of Contents
Fetching ...

World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge

Moo Hyun Son, Jintaek Oh, Sun Bin Mun, Jaechul Roh, Sehyun Choi

TL;DR

Diffusion-based text-to-image models struggle with prompts that introduce novel concepts outside their training data. World-To-Image (W2I) introduces an agent-driven framework that dynamically retrieves web-based textual definitions and reference images, performs semantic decomposition and concept substitution, and grounds outputs with retrieved exemplars to improve semantic fidelity without altering the base model. The approach integrates an Orchestrator with a Prompt Optimizer Agent and an Image Retriever Agent, achieving substantial gains (notably +8.1% in accuracy-to-prompt on the NICE benchmark) while maintaining competitive aesthetics, and demonstrates efficiency by completing optimization in two iterations. This work highlights the practical value of interface-level improvements, showing that retrieval-grounded, multimodal prompting can unlock latent knowledge in pretrained models, enabling better alignment with an evolving real world without retraining or scaling.

Abstract

While text-to-image (T2I) models can synthesize high-quality images, their performance degrades significantly when prompted with novel or out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We introduce World-To-Image, a novel framework that bridges this gap by empowering T2I generation with agent-driven world knowledge. We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model. This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis. Critically, our evaluation goes beyond traditional metrics, utilizing modern assessments like LLMGrader and ImageReward to measure true semantic fidelity. Our experiments show that World-To-Image substantially outperforms state-of-the-art methods in both semantic alignment and visual aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated NICE benchmark. Our framework achieves these results with high efficiency in less than three iterations, paving the way for T2I systems that can better reflect the ever-changing real world. Our demo code is available here\footnote{https://github.com/mhson-kyle/World-To-Image}.

World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge

TL;DR

Diffusion-based text-to-image models struggle with prompts that introduce novel concepts outside their training data. World-To-Image (W2I) introduces an agent-driven framework that dynamically retrieves web-based textual definitions and reference images, performs semantic decomposition and concept substitution, and grounds outputs with retrieved exemplars to improve semantic fidelity without altering the base model. The approach integrates an Orchestrator with a Prompt Optimizer Agent and an Image Retriever Agent, achieving substantial gains (notably +8.1% in accuracy-to-prompt on the NICE benchmark) while maintaining competitive aesthetics, and demonstrates efficiency by completing optimization in two iterations. This work highlights the practical value of interface-level improvements, showing that retrieval-grounded, multimodal prompting can unlock latent knowledge in pretrained models, enabling better alignment with an evolving real world without retraining or scaling.

Abstract

While text-to-image (T2I) models can synthesize high-quality images, their performance degrades significantly when prompted with novel or out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We introduce World-To-Image, a novel framework that bridges this gap by empowering T2I generation with agent-driven world knowledge. We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model. This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis. Critically, our evaluation goes beyond traditional metrics, utilizing modern assessments like LLMGrader and ImageReward to measure true semantic fidelity. Our experiments show that World-To-Image substantially outperforms state-of-the-art methods in both semantic alignment and visual aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated NICE benchmark. Our framework achieves these results with high efficiency in less than three iterations, paving the way for T2I systems that can better reflect the ever-changing real world. Our demo code is available here\footnote{https://github.com/mhson-kyle/World-To-Image}.

Paper Structure

This paper contains 26 sections, 3 equations, 13 figures, 10 tables, 1 algorithm.

Figures (13)

  • Figure 1: Overview of World-To-Image.
  • Figure 2: Illustration of a case where the Orchestrator Agent invokes the Image Retriever Agent (invoke-IRA=1).
  • Figure 3: Qualitative comparison of text-to-image generation results across seven models. Our model consistently demonstrates stronger semantic alignment (e.g., “Doomer Doge staring at TikTok stock crash”), accurate identity grounding (e.g., “Kai Cenat streaming from spaceship”), and faithful concept representation (e.g., “mommy AI”), outperforming baselines in both fidelity and prompt adherence.
  • Figure 4: Qualitative comparison of image generations across models for diverse prompts. Each row corresponds to one prompt, with columns showing outputs from left to right: Ours, OmniGen2, Promptist w OmniGen2, Promptist w SDXL-Base, SDXL-Base, SD2.1, and SD1.4.
  • Figure 5: LLM Grader overall scores across subcategories. Our method consistently outperforms all baselines.
  • ...and 8 more figures