Table of Contents
Fetching ...

Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models

Zeqiang Lai, Xizhou Zhu, Jifeng Dai, Yu Qiao, Wenhai Wang

TL;DR

This paper introduces Interactive Text to Image (iT2I), a training-free framework that augments existing large language models (LLMs) to generate and refine images via natural language dialogue. It proposes Mini-DALLE3, a two-stage architecture (router and adapter) that converts multi-turn conversations into intermediate textual image descriptions and leverages off-the-shelf T2I models, guided by prompt refinement and hierarchical content control. Evaluations across diverse LLMs show that iT2I can be added to existing systems with minimal impact on core LLM tasks while enabling consistent, multi-turn image generation and editing. The work aims to improve both the user experience and image fidelity in future T2I systems, providing a scalable path toward human-machine collaboration in visual content creation.

Abstract

The revolution of artificial intelligence content generation has been rapidly accelerated with the booming text-to-image (T2I) diffusion models. Within just two years of development, it was unprecedentedly of high-quality, diversity, and creativity that the state-of-the-art models could generate. However, a prevalent limitation persists in the effective communication with these popular T2I models, such as Stable Diffusion, using natural language descriptions. This typically makes an engaging image hard to obtain without expertise in prompt engineering with complex word compositions, magic tags, and annotations. Inspired by the recently released DALLE3 - a T2I model directly built-in ChatGPT that talks human language, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I), where people can interact with LLM for interleaved high-quality image generation/edit/refinement and question answering with stronger images and text correspondences using natural language. In addressing the iT2I problem, we present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models. We evaluate our approach for iT2I in a variety of common-used scenarios under different LLMs, e.g., ChatGPT, LLAMA, Baichuan, and InternLM. We demonstrate that our approach could be a convenient and low-cost way to introduce the iT2I ability for any existing LLMs and any text-to-image models without any training while bringing little degradation on LLMs' inherent capabilities in, e.g., question answering and code generation. We hope this work could draw broader attention and provide inspiration for boosting user experience in human-machine interactions alongside the image quality of the next-generation T2I systems.

Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models

TL;DR

This paper introduces Interactive Text to Image (iT2I), a training-free framework that augments existing large language models (LLMs) to generate and refine images via natural language dialogue. It proposes Mini-DALLE3, a two-stage architecture (router and adapter) that converts multi-turn conversations into intermediate textual image descriptions and leverages off-the-shelf T2I models, guided by prompt refinement and hierarchical content control. Evaluations across diverse LLMs show that iT2I can be added to existing systems with minimal impact on core LLM tasks while enabling consistent, multi-turn image generation and editing. The work aims to improve both the user experience and image fidelity in future T2I systems, providing a scalable path toward human-machine collaboration in visual content creation.

Abstract

The revolution of artificial intelligence content generation has been rapidly accelerated with the booming text-to-image (T2I) diffusion models. Within just two years of development, it was unprecedentedly of high-quality, diversity, and creativity that the state-of-the-art models could generate. However, a prevalent limitation persists in the effective communication with these popular T2I models, such as Stable Diffusion, using natural language descriptions. This typically makes an engaging image hard to obtain without expertise in prompt engineering with complex word compositions, magic tags, and annotations. Inspired by the recently released DALLE3 - a T2I model directly built-in ChatGPT that talks human language, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I), where people can interact with LLM for interleaved high-quality image generation/edit/refinement and question answering with stronger images and text correspondences using natural language. In addressing the iT2I problem, we present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models. We evaluate our approach for iT2I in a variety of common-used scenarios under different LLMs, e.g., ChatGPT, LLAMA, Baichuan, and InternLM. We demonstrate that our approach could be a convenient and low-cost way to introduce the iT2I ability for any existing LLMs and any text-to-image models without any training while bringing little degradation on LLMs' inherent capabilities in, e.g., question answering and code generation. We hope this work could draw broader attention and provide inspiration for boosting user experience in human-machine interactions alongside the image quality of the next-generation T2I systems.
Paper Structure (15 sections, 8 figures, 2 tables)

This paper contains 15 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Examples of two interactive text-to-image conversations produced by Mini DALL路E 3. In these cases, people can ask the agent to generate images via natural language and request an edit if the results are unsatisfactory. The generation and editing can be completed in a multi-turn dialog with recognition of the conservation context.
  • Figure 2: The evolution of image generation systems.
  • Figure 3: Illustrations of different human-machine interfaces for T2I systems.
  • Figure 4: Illustration of 6 types of interactions in interactive text-to-image workflow.
  • Figure 5: Pipeline Overview. Mini-DALLE3 consists of two stages, with 1) a router that analyzes the response from the prompted/finetuned LLM and dispatches the demand for image generation if needed, and 2) an adapter that transforms the image embedding or descriptions for subsequent T2I models.
  • ...and 3 more figures