Table of Contents
Fetching ...

TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

Junhao Cheng, Baiqiao Yin, Kaixin Cai, Minbin Huang, Hanhui Li, Yuxin He, Xi Lu, Yue Li, Yifei Li, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang

TL;DR

TheaterGen tackles the challenge of semantic and contextual consistency in multi-turn image generation by deploying a training-free collaboration between an LLM and diffusion-based T2I models. It introduces a structured prompt book managed by an LLM (Screenwriter), a per-turn character rehearsal using reference images and guidance (Rehearsal), and a final turn generation that merges prompts with latent and lineart guidance (Final Performance). A new CMIGBench benchmark with 8000 dialogues (no pre-defined characters) enables zero-shot evaluation of both story generation and multi-turn editing, where TheaterGen significantly outperforms state-of-the-art baselines and improves average character-image and text-image alignment metrics. The work demonstrates that LLM-driven prompt management and guided diffusion can achieve high semantic and contextual coherence in multi-turn synthesis without task-specific training, with practical implications for interactive storytelling and editing workflows.

Abstract

Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency between images and texts, as well as contextual consistency of the same subject across multiple interactive turns. To address this issue, we introduce TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to provide the capability of multi-turn image generation. Within this framework, LLMs, acting as a "Screenwriter", engage in multi-turn interaction, generating and managing a standardized prompt book that encompasses prompts and layout designs for each character in the target image. Based on these, Theatergen generate a list of character images and extract guidance information, akin to the "Rehearsal". Subsequently, through incorporating the prompt book and guidance information into the reverse denoising process of T2I diffusion models, Theatergen generate the final image, as conducting the "Final Performance". With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images. Furthermore, we introduce a dedicated benchmark, CMIGBench (Consistent Multi-turn Image Generation Benchmark) with 8000 multi-turn instructions. Different from previous multi-turn benchmarks, CMIGBench does not define characters in advance. Both the tasks of story generation and multi-turn editing are included on CMIGBench for comprehensive evaluation. Extensive experimental results show that TheaterGen outperforms state-of-the-art methods significantly. It raises the performance bar of the cutting-edge Mini DALLE 3 model by 21% in average character-character similarity and 19% in average text-image similarity.

TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

TL;DR

TheaterGen tackles the challenge of semantic and contextual consistency in multi-turn image generation by deploying a training-free collaboration between an LLM and diffusion-based T2I models. It introduces a structured prompt book managed by an LLM (Screenwriter), a per-turn character rehearsal using reference images and guidance (Rehearsal), and a final turn generation that merges prompts with latent and lineart guidance (Final Performance). A new CMIGBench benchmark with 8000 dialogues (no pre-defined characters) enables zero-shot evaluation of both story generation and multi-turn editing, where TheaterGen significantly outperforms state-of-the-art baselines and improves average character-image and text-image alignment metrics. The work demonstrates that LLM-driven prompt management and guided diffusion can achieve high semantic and contextual coherence in multi-turn synthesis without task-specific training, with practical implications for interactive storytelling and editing workflows.

Abstract

Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency between images and texts, as well as contextual consistency of the same subject across multiple interactive turns. To address this issue, we introduce TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to provide the capability of multi-turn image generation. Within this framework, LLMs, acting as a "Screenwriter", engage in multi-turn interaction, generating and managing a standardized prompt book that encompasses prompts and layout designs for each character in the target image. Based on these, Theatergen generate a list of character images and extract guidance information, akin to the "Rehearsal". Subsequently, through incorporating the prompt book and guidance information into the reverse denoising process of T2I diffusion models, Theatergen generate the final image, as conducting the "Final Performance". With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images. Furthermore, we introduce a dedicated benchmark, CMIGBench (Consistent Multi-turn Image Generation Benchmark) with 8000 multi-turn instructions. Different from previous multi-turn benchmarks, CMIGBench does not define characters in advance. Both the tasks of story generation and multi-turn editing are included on CMIGBench for comprehensive evaluation. Extensive experimental results show that TheaterGen outperforms state-of-the-art methods significantly. It raises the performance bar of the cutting-edge Mini DALLE 3 model by 21% in average character-character similarity and 19% in average text-image similarity.
Paper Structure (33 sections, 6 equations, 24 figures, 4 tables)

This paper contains 33 sections, 6 equations, 24 figures, 4 tables.

Figures (24)

  • Figure 1: Visual comparison between Mini DALL·E 3 Mini-DALLE3 and our proposed TheaterGen in multi-turn story generation and multi-turn editing.
  • Figure 2: The overall structure of TheaterGen. TheaterGen utilizes three key components to generate an image in each interaction turn: (a) an LLM-based character designer that interacts with the user and maintains a structured prompt book for all character prompts and layouts, which serves as the "screenwriter"; (b) a character image manager for "rehearsal", which generates character images and extracts guidance based on the prompt book; (c) a character-guided generator that conducts the "final performance", i.e., generates the final image for the current turn by combining the prompt book and guidance information.
  • Figure 3: The proposed guidance extractor. It first extracts subjects from character images and rearranges them into the same image according to the layout. Then the lineart guidance and the latent guidance for subsequent image generation are obtained via a lineart processer and the forward diffusion process, respectively.
  • Figure 4: Ablation study on the effects of lineart and latent guidance.
  • Figure 5: Ablation results of two guidance approaches with consistency metrics.
  • ...and 19 more figures