GPTDrawer: Enhancing Visual Synthesis through ChatGPT
Kun Li, Xinwei Chen, Tianyou Song, Hansong Zhang, Wenzhe Zhang, Qing Shan
TL;DR
The paper addresses the misalignment between complex textual prompts and visual outputs in diffusion-based image synthesis. It introduces GPTDrawer, a pipeline that uses ChatGPT for keyword extraction and prompt refinement, then iteratively regenerates images with Stable Diffusion, guided by the cosine similarity between image and text representations $Sim_{cos}$ and threshold $T$. The approach leverages BLIP-based evaluation to inform refinements and demonstrates improvements over a baseline SD pipeline on two scenes, highlighting better keyword coverage and semantic fidelity. This work showcases a practical NLP-augmented framework for more faithful AI-generated visuals, with significant implications for creative arts and design automation.
Abstract
In the burgeoning field of AI-driven image generation, the quest for precision and relevance in response to textual prompts remains paramount. This paper introduces GPTDrawer, an innovative pipeline that leverages the generative prowess of GPT-based models to enhance the visual synthesis process. Our methodology employs a novel algorithm that iteratively refines input prompts using keyword extraction, semantic analysis, and image-text congruence evaluation. By integrating ChatGPT for natural language processing and Stable Diffusion for image generation, GPTDrawer produces a batch of images that undergo successive refinement cycles, guided by cosine similarity metrics until a threshold of semantic alignment is attained. The results demonstrate a marked improvement in the fidelity of images generated in accordance with user-defined prompts, showcasing the system's ability to interpret and visualize complex semantic constructs. The implications of this work extend to various applications, from creative arts to design automation, setting a new benchmark for AI-assisted creative processes.
