Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models
Zeqiang Lai, Xizhou Zhu, Jifeng Dai, Yu Qiao, Wenhai Wang
TL;DR
This paper introduces Interactive Text to Image (iT2I), a training-free framework that augments existing large language models (LLMs) to generate and refine images via natural language dialogue. It proposes Mini-DALLE3, a two-stage architecture (router and adapter) that converts multi-turn conversations into intermediate textual image descriptions and leverages off-the-shelf T2I models, guided by prompt refinement and hierarchical content control. Evaluations across diverse LLMs show that iT2I can be added to existing systems with minimal impact on core LLM tasks while enabling consistent, multi-turn image generation and editing. The work aims to improve both the user experience and image fidelity in future T2I systems, providing a scalable path toward human-machine collaboration in visual content creation.
Abstract
The revolution of artificial intelligence content generation has been rapidly accelerated with the booming text-to-image (T2I) diffusion models. Within just two years of development, it was unprecedentedly of high-quality, diversity, and creativity that the state-of-the-art models could generate. However, a prevalent limitation persists in the effective communication with these popular T2I models, such as Stable Diffusion, using natural language descriptions. This typically makes an engaging image hard to obtain without expertise in prompt engineering with complex word compositions, magic tags, and annotations. Inspired by the recently released DALLE3 - a T2I model directly built-in ChatGPT that talks human language, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I), where people can interact with LLM for interleaved high-quality image generation/edit/refinement and question answering with stronger images and text correspondences using natural language. In addressing the iT2I problem, we present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models. We evaluate our approach for iT2I in a variety of common-used scenarios under different LLMs, e.g., ChatGPT, LLAMA, Baichuan, and InternLM. We demonstrate that our approach could be a convenient and low-cost way to introduce the iT2I ability for any existing LLMs and any text-to-image models without any training while bringing little degradation on LLMs' inherent capabilities in, e.g., question answering and code generation. We hope this work could draw broader attention and provide inspiration for boosting user experience in human-machine interactions alongside the image quality of the next-generation T2I systems.
