Table of Contents
Fetching ...

DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, Wei Liu

TL;DR

DialogGen tackles the challenge of multi-turn, multi-modal text-to-image generation by aligning off-the-shelf MLLMs with T2I diffusion models through drawing prompt alignment, bilingual instruction tuning, and an error-correction framework. It introduces DialogBen, a bilingual benchmark with modality-switching and coherence metrics to evaluate MIDS performance. Empirical results show DialogGen outperforms state-of-the-art baselines on modality switching accuracy and generation coherence, across multiple T2I backbones and languages. The work also provides a comprehensive evaluation protocol and dataset resources to foster fair assessment and advancement in multi-turn, multi-modal image generation systems.

Abstract

Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.

DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

TL;DR

DialogGen tackles the challenge of multi-turn, multi-modal text-to-image generation by aligning off-the-shelf MLLMs with T2I diffusion models through drawing prompt alignment, bilingual instruction tuning, and an error-correction framework. It introduces DialogBen, a bilingual benchmark with modality-switching and coherence metrics to evaluate MIDS performance. Empirical results show DialogGen outperforms state-of-the-art baselines on modality switching accuracy and generation coherence, across multiple T2I backbones and languages. The work also provides a comprehensive evaluation protocol and dataset resources to foster fair assessment and advancement in multi-turn, multi-modal image generation systems.

Abstract

Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
Paper Structure (19 sections, 6 equations, 8 figures, 6 tables)

This paper contains 19 sections, 6 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Illustration of Multi-modal Interactive Dialogue System (MIDS) built by our proposed DialogGen that can perform multi-turn multi-modal tasks responding to user's natural language instructions to meet the users' needs for image generation, image editing, and chatting.
  • Figure 2: Our benchmark encompasses 7 edit instruction types and 13 topic types.
  • Figure 3: Overview of MIDS and DialogBen. MIDS can respond to the multi-modal user instructions with either a text response or a drawing prompt to be sent to a T2I model for image generation. DialogBen consists of 9957 three-turn multi-modal dialogs and two evaluation metrics to assess the capability of MIDS.
  • Figure 4: The overall pipeline of DialogGen which consists of Drawing Prompt Alignment, Training Data Curation, and Error Correction. In Drawing Prompt Alignment, re-captioning on $D_G$ is performed to ensure the alignment between transformed prompts and the T2I model. Then we carefully curate the training data such as adding object consistency guarantee, bilingual data and mixed instruction tuning data during training. Finally, we employ an error correction mechanism on student model $M_s$ to make the model learn from its mistakes.
  • Figure 5: Visualization of output results of NexTGPT, DialogGen-SD, and DialogGen-Hunyuan on the DialogBen benchmark. DialogGen has better performance in generating output of correct modality and higher semantic coherence.
  • ...and 3 more figures