Table of Contents
Fetching ...

The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation

Yi Yao, Chan-Feng Hsu, Jhe-Hao Lin, Hongxia Xie, Terence Lin, Yi-Ning Huang, Hong-Han Shuai, Wen-Huang Cheng

TL;DR

The paper tackles the difficulty diffusion models face when following prompts that require scientific reasoning or creative imagination. It introduces RFBench, a Realistic-Fantasy Benchmark, and RFNet, a training-free pipeline that couples diffusion models with LLM-derived layouts and a semantic alignment module to produce coherent, detailed scenes. Through extensive GPT-based and human evaluations, RFNet demonstrates superior performance over state-of-the-art methods in both Realistic & Analytical and Creativity & Imagination prompts, with ablations confirming the value of LLM-driven detail synthesis, semantic alignment, and cross-attention-guided synthesis. The work enables accurate, imaginative scene generation without retraining, highlighting a practical pathway to improved prompt understanding and image fidelity in real-world applications, and provides publicly available code and RFBench data for reproducible research.

Abstract

In spite of recent advancements in text-to-image generation, limitations persist in handling complex and imaginative prompts due to the restricted diversity and complexity of training data. This work explores how diffusion models can generate images from prompts requiring artistic creativity or specialized knowledge. We introduce the Realistic-Fantasy Benchmark (RFBench), a novel evaluation framework blending realistic and fantastical scenarios. To address these challenges, we propose the Realistic-Fantasy Network (RFNet), a training-free approach integrating diffusion models with LLMs. Extensive human evaluations and GPT-based compositional assessments demonstrate our approach's superiority over state-of-the-art methods. Our code and dataset is available at https://leo81005.github.io/Reality-and-Fantasy/.

The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation

TL;DR

The paper tackles the difficulty diffusion models face when following prompts that require scientific reasoning or creative imagination. It introduces RFBench, a Realistic-Fantasy Benchmark, and RFNet, a training-free pipeline that couples diffusion models with LLM-derived layouts and a semantic alignment module to produce coherent, detailed scenes. Through extensive GPT-based and human evaluations, RFNet demonstrates superior performance over state-of-the-art methods in both Realistic & Analytical and Creativity & Imagination prompts, with ablations confirming the value of LLM-driven detail synthesis, semantic alignment, and cross-attention-guided synthesis. The work enables accurate, imaginative scene generation without retraining, highlighting a practical pathway to improved prompt understanding and image fidelity in real-world applications, and provides publicly available code and RFBench data for reproducible research.

Abstract

In spite of recent advancements in text-to-image generation, limitations persist in handling complex and imaginative prompts due to the restricted diversity and complexity of training data. This work explores how diffusion models can generate images from prompts requiring artistic creativity or specialized knowledge. We introduce the Realistic-Fantasy Benchmark (RFBench), a novel evaluation framework blending realistic and fantastical scenarios. To address these challenges, we propose the Realistic-Fantasy Network (RFNet), a training-free approach integrating diffusion models with LLMs. Extensive human evaluations and GPT-based compositional assessments demonstrate our approach's superiority over state-of-the-art methods. Our code and dataset is available at https://leo81005.github.io/Reality-and-Fantasy/.
Paper Structure (23 sections, 5 equations, 13 figures, 5 tables)

This paper contains 23 sections, 5 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Text-to-image diffusion models such as Stable Diffusion rombach2022high often struggle to accurately follow prompts that involve scientific and empirical reasoning, metaphorical thinking, role conflicting, or imaginative scenarios. Our method achieves enhanced prompt understanding capabilities and accurately follows these types of prompts.
  • Figure 2: The collection pipeline of our proposed RFBench.
  • Figure 3: Overview of our proposed Realistic-Fantasy Network (RFNet). In stage 1, the user's input prompt is first processed by a LLM to extract the layout and descriptions. The descriptions then go through a text encoder, which is the text-processing component of the CLIP model, and are refined by the SAA to form a better prompt. In stage 2, the refined prompts are fed into the diffusion model for in-depth object generation, which creates each target object with precision. The resulting cross-attention map and mask latent are then utilized for seamless background integration, merging objects into one single image.
  • Figure 4: Comprehensive Image Synthesis. In step 1, utilizing the prompt refined by the SAA module, the frozen stable diffusion model generates each foreground object independently. During the denoising phase, the cross-attention map is extracted and saved for Guidance Constraint in the next step. In addition, a Suppression Constraint is also added in step 2 to minimize influence between different objects.
  • Figure 5: Qualitative comparison on RFBench. The compared models include (a) Stable Diffusion, (b) MultiDiffusion, (c) Attend and Excite, (d) LMD, (e) BoxDiff, (f) SDXL, (g) Ours (Best viewed in color and zoom in. More samples can be found in our supplementary material.)
  • ...and 8 more figures