Table of Contents
Fetching ...

AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang

TL;DR

AutoStudio addresses the difficulty of maintaining subject consistency in multi-turn interactive image generation by introducing a training-free multi-agent system that coordinates subject management, layout refinement, and diffusion-based rendering. It combines three LLM-based agents with a Stable Diffusion-based drawer and augments the pipeline with a Parallel-UNet and a subject-initialized generation approach to preserve multi-subject fidelity. The method demonstrates state-of-the-art performance on CMIGBench in both quantitative measures (aFID and aCCS) and qualitative human evaluations, while ablation studies validate the contributions of supervision, dual cross-attention, and subject-guided initialization. The work suggests strong practical potential for interactive storytelling, manga generation, and open-ended editing, while acknowledging computational considerations and safety considerations for user-driven content.

Abstract

As cutting-edge Text-to-Image (T2I) generation models already excel at producing remarkable single images, an even more challenging task, i.e., multi-turn interactive image generation begins to attract the attention of related research communities. This task requires models to interact with users over multiple turns to generate a coherent sequence of images. However, since users may switch subjects frequently, current efforts struggle to maintain subject consistency while generating diverse images. To address this issue, we introduce a training-free multi-agent framework called AutoStudio. AutoStudio employs three agents based on large language models (LLMs) to handle interactions, along with a stable diffusion (SD) based agent for generating high-quality images. Specifically, AutoStudio consists of (i) a subject manager to interpret interaction dialogues and manage the context of each subject, (ii) a layout generator to generate fine-grained bounding boxes to control subject locations, (iii) a supervisor to provide suggestions for layout refinements, and (iv) a drawer to complete image generation. Furthermore, we introduce a Parallel-UNet to replace the original UNet in the drawer, which employs two parallel cross-attention modules for exploiting subject-aware features. We also introduce a subject-initialized generation method to better preserve small subjects. Our AutoStudio hereby can generate a sequence of multi-subject images interactively and consistently. Extensive experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains multi-subject consistency across multiple turns well, and it also raises the state-of-the-art performance by 13.65% in average Frechet Inception Distance and 2.83% in average character-character similarity.

AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

TL;DR

AutoStudio addresses the difficulty of maintaining subject consistency in multi-turn interactive image generation by introducing a training-free multi-agent system that coordinates subject management, layout refinement, and diffusion-based rendering. It combines three LLM-based agents with a Stable Diffusion-based drawer and augments the pipeline with a Parallel-UNet and a subject-initialized generation approach to preserve multi-subject fidelity. The method demonstrates state-of-the-art performance on CMIGBench in both quantitative measures (aFID and aCCS) and qualitative human evaluations, while ablation studies validate the contributions of supervision, dual cross-attention, and subject-guided initialization. The work suggests strong practical potential for interactive storytelling, manga generation, and open-ended editing, while acknowledging computational considerations and safety considerations for user-driven content.

Abstract

As cutting-edge Text-to-Image (T2I) generation models already excel at producing remarkable single images, an even more challenging task, i.e., multi-turn interactive image generation begins to attract the attention of related research communities. This task requires models to interact with users over multiple turns to generate a coherent sequence of images. However, since users may switch subjects frequently, current efforts struggle to maintain subject consistency while generating diverse images. To address this issue, we introduce a training-free multi-agent framework called AutoStudio. AutoStudio employs three agents based on large language models (LLMs) to handle interactions, along with a stable diffusion (SD) based agent for generating high-quality images. Specifically, AutoStudio consists of (i) a subject manager to interpret interaction dialogues and manage the context of each subject, (ii) a layout generator to generate fine-grained bounding boxes to control subject locations, (iii) a supervisor to provide suggestions for layout refinements, and (iv) a drawer to complete image generation. Furthermore, we introduce a Parallel-UNet to replace the original UNet in the drawer, which employs two parallel cross-attention modules for exploiting subject-aware features. We also introduce a subject-initialized generation method to better preserve small subjects. Our AutoStudio hereby can generate a sequence of multi-subject images interactively and consistently. Extensive experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains multi-subject consistency across multiple turns well, and it also raises the state-of-the-art performance by 13.65% in average Frechet Inception Distance and 2.83% in average character-character similarity.
Paper Structure (33 sections, 15 equations, 17 figures, 4 tables)

This paper contains 33 sections, 15 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Two comic books generated by AutoStudio.
  • Figure 2: Visual examples of multi-turn interactive image generation tasks that can be achieved by AutoStudio while remaining challenging for other cutting-edge methods.
  • Figure 3: Architecture comparison between AutoStudio (f) and other models, including (a) AutoStory, (b) StoryDiffusion, (c) Mini-Gemini, (d) Mini DaLLE·3, and (e) TheaterGen.
  • Figure 4: Overall structure of AutoStudio. AutoStudio leverages four agents and a subject database to complete multi-turn multi-subject interactive image generation: (i) A subject manager interprets user dialogues; (ii) A layout generator provides layout; (iii) A supervisor provides suggestions for layout refinement; (iv) A drawer generates images given refined layouts and the subject database.
  • Figure 5: Overall structure of our subject-initialized generation method.
  • ...and 12 more figures