The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation
Yi Yao, Chan-Feng Hsu, Jhe-Hao Lin, Hongxia Xie, Terence Lin, Yi-Ning Huang, Hong-Han Shuai, Wen-Huang Cheng
TL;DR
The paper tackles the difficulty diffusion models face when following prompts that require scientific reasoning or creative imagination. It introduces RFBench, a Realistic-Fantasy Benchmark, and RFNet, a training-free pipeline that couples diffusion models with LLM-derived layouts and a semantic alignment module to produce coherent, detailed scenes. Through extensive GPT-based and human evaluations, RFNet demonstrates superior performance over state-of-the-art methods in both Realistic & Analytical and Creativity & Imagination prompts, with ablations confirming the value of LLM-driven detail synthesis, semantic alignment, and cross-attention-guided synthesis. The work enables accurate, imaginative scene generation without retraining, highlighting a practical pathway to improved prompt understanding and image fidelity in real-world applications, and provides publicly available code and RFBench data for reproducible research.
Abstract
In spite of recent advancements in text-to-image generation, limitations persist in handling complex and imaginative prompts due to the restricted diversity and complexity of training data. This work explores how diffusion models can generate images from prompts requiring artistic creativity or specialized knowledge. We introduce the Realistic-Fantasy Benchmark (RFBench), a novel evaluation framework blending realistic and fantastical scenarios. To address these challenges, we propose the Realistic-Fantasy Network (RFNet), a training-free approach integrating diffusion models with LLMs. Extensive human evaluations and GPT-based compositional assessments demonstrate our approach's superiority over state-of-the-art methods. Our code and dataset is available at https://leo81005.github.io/Reality-and-Fantasy/.
