Table of Contents
Fetching ...

DiffChat: Learning to Chat with Text-to-Image Synthesis Models for Interactive Image Creation

Jiapeng Wang, Chengyu Wang, Tingfeng Cao, Jun Huang, Lianwen Jin

TL;DR

DiffChat tackles the challenge of enabling non-experts to create high-quality images by learning to prompt diffusion-based text-to-image synthesis systems through chat. It assembles InstructPE, a large instruction-following prompt-engineering dataset, and trains a decoder-only LLM via supervised fine-tuning, followed by an enhanced PPO-based reinforcement learning framework guided by aesthetics, user preference, and content integrity. Key innovations include Action-space Dynamic Modification for sampling and Content Integrity-aware value estimation, which improve positive sample quality and state evaluation. Experiments across multiple diffusion models and human studies show that DiffChat outperforms strong baselines and generalizes across TIS variants, enabling practical interactive image creation with reduced prompt engineering effort.

Abstract

We present DiffChat, a novel method to align Large Language Models (LLMs) to "chat" with prompt-as-input Text-to-Image Synthesis (TIS) models (e.g., Stable Diffusion) for interactive image creation. Given a raw prompt/image and a user-specified instruction, DiffChat can effectively make appropriate modifications and generate the target prompt, which can be leveraged to create the target image of high quality. To achieve this, we first collect an instruction-following prompt engineering dataset named InstructPE for the supervised training of DiffChat. Next, we propose a reinforcement learning framework with the feedback of three core criteria for image creation, i.e., aesthetics, user preference, and content integrity. It involves an action-space dynamic modification technique to obtain more relevant positive samples and harder negative samples during the off-policy sampling. Content integrity is also introduced into the value estimation function for further improvement of produced images. Our method can exhibit superior performance than baseline models and strong competitors based on both automatic and human evaluations, which fully demonstrates its effectiveness.

DiffChat: Learning to Chat with Text-to-Image Synthesis Models for Interactive Image Creation

TL;DR

DiffChat tackles the challenge of enabling non-experts to create high-quality images by learning to prompt diffusion-based text-to-image synthesis systems through chat. It assembles InstructPE, a large instruction-following prompt-engineering dataset, and trains a decoder-only LLM via supervised fine-tuning, followed by an enhanced PPO-based reinforcement learning framework guided by aesthetics, user preference, and content integrity. Key innovations include Action-space Dynamic Modification for sampling and Content Integrity-aware value estimation, which improve positive sample quality and state evaluation. Experiments across multiple diffusion models and human studies show that DiffChat outperforms strong baselines and generalizes across TIS variants, enabling practical interactive image creation with reduced prompt engineering effort.

Abstract

We present DiffChat, a novel method to align Large Language Models (LLMs) to "chat" with prompt-as-input Text-to-Image Synthesis (TIS) models (e.g., Stable Diffusion) for interactive image creation. Given a raw prompt/image and a user-specified instruction, DiffChat can effectively make appropriate modifications and generate the target prompt, which can be leveraged to create the target image of high quality. To achieve this, we first collect an instruction-following prompt engineering dataset named InstructPE for the supervised training of DiffChat. Next, we propose a reinforcement learning framework with the feedback of three core criteria for image creation, i.e., aesthetics, user preference, and content integrity. It involves an action-space dynamic modification technique to obtain more relevant positive samples and harder negative samples during the off-policy sampling. Content integrity is also introduced into the value estimation function for further improvement of produced images. Our method can exhibit superior performance than baseline models and strong competitors based on both automatic and human evaluations, which fully demonstrates its effectiveness.
Paper Structure (31 sections, 6 equations, 12 figures, 4 tables)

This paper contains 31 sections, 6 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: (a) The pipeline of our DiffChat collaborating with off-the-shelf TIS models for interactive image iteration. (b) A simple example of DiffChat following instructions to interact with TIS models (Stable Diffusion XL here) for interactive image creation. Note that DiffChat is capable of automatic prompt refinement and re-writing through "chats" and can be applied to a variety of TIS models.
  • Figure 2: Data collection process of InstructPE.
  • Figure 3: The training procedure of DiffChat.
  • Figure 4: Qualitative results of InstructPix2Pix and DiffChat + SD for instruction-following image creation.
  • Figure 5: Results of human preference evaluation (i.e., Win/Tie/Lose rates of our method against others). IP2P is short for InstructPix2Pix.
  • ...and 7 more figures