Table of Contents
Fetching ...

Manga Generation via Layout-controllable Diffusion

Siyu Chen, Dengjie Li, Zenghao Bao, Yao Zhou, Lingfeng Tan, Yujie Zhong, Zheng Zhao

TL;DR

The paper tackles generating multi-panel manga pages from plain text by introducing Manga109Story as a captioned, story-aligned extension of Manga109 and a diffusion-based generator, MangaDiffusion, that models intra-panel and inter-panel interactions. It uses a two-step pipeline: segmenting an input story with an LLM into per-panel scripts and then generating panels with a Transformer-based diffusion model while masking speech bubbles to reduce clutter. The approach achieves controlled panel counts and diverse, coherent layouts, demonstrating strong quantitative results (FID and CLIP-I) and qualitative layout consistency, while noting data limitations and room for improvement in cross-panel coherence and character consistency. The work provides a practical path to convert textual narratives into engaging manga content and offers datasets and architectural insights for future manga generation research.

Abstract

Generating comics through text is widely studied. However, there are few studies on generating multi-panel Manga (Japanese comics) solely based on plain text. Japanese manga contains multiple panels on a single page, with characteristics such as coherence in storytelling, reasonable and diverse page layouts, consistency in characters, and semantic correspondence between panel drawings and panel scripts. Therefore, generating manga poses a significant challenge. This paper presents the manga generation task and constructs the Manga109Story dataset for studying manga generation solely from plain text. Additionally, we propose MangaDiffusion to facilitate the intra-panel and inter-panel information interaction during the manga generation process. The results show that our method particularly ensures the number of panels, reasonable and diverse page layouts. Based on our approach, there is potential to converting a large amount of textual stories into more engaging manga readings, leading to significant application prospects.

Manga Generation via Layout-controllable Diffusion

TL;DR

The paper tackles generating multi-panel manga pages from plain text by introducing Manga109Story as a captioned, story-aligned extension of Manga109 and a diffusion-based generator, MangaDiffusion, that models intra-panel and inter-panel interactions. It uses a two-step pipeline: segmenting an input story with an LLM into per-panel scripts and then generating panels with a Transformer-based diffusion model while masking speech bubbles to reduce clutter. The approach achieves controlled panel counts and diverse, coherent layouts, demonstrating strong quantitative results (FID and CLIP-I) and qualitative layout consistency, while noting data limitations and room for improvement in cross-panel coherence and character consistency. The work provides a practical path to convert textual narratives into engaging manga content and offers datasets and architectural insights for future manga generation research.

Abstract

Generating comics through text is widely studied. However, there are few studies on generating multi-panel Manga (Japanese comics) solely based on plain text. Japanese manga contains multiple panels on a single page, with characteristics such as coherence in storytelling, reasonable and diverse page layouts, consistency in characters, and semantic correspondence between panel drawings and panel scripts. Therefore, generating manga poses a significant challenge. This paper presents the manga generation task and constructs the Manga109Story dataset for studying manga generation solely from plain text. Additionally, we propose MangaDiffusion to facilitate the intra-panel and inter-panel information interaction during the manga generation process. The results show that our method particularly ensures the number of panels, reasonable and diverse page layouts. Based on our approach, there is potential to converting a large amount of textual stories into more engaging manga readings, leading to significant application prospects.
Paper Structure (25 sections, 1 equation, 8 figures, 1 table)

This paper contains 25 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: Difference between story generation task and manga generation task. (a) The prompt is inputted into the model one by one. The generation of images is controlled independently, therefore it is no need to consider the reading order and layout of the images. (b) The prompt is inputted into the model as batches, and the model processes all inputs at once. Each input controls the content generation of each panel, and the model controls the overall order and layout of the panels, ultimately outputting a manga page with multiple panels.
  • Figure 2: The construction process of Manga109Story dataset. The Manga109 dataset includes basic information such as coordinates of panel, character, face, and text, as well as the text content. The Manga109Dialog dataset associates dialogues with their respective speakers. We utilize a panel order estimator to predict the panel order of each manga page. By combining this information, we create an XML file and input it together with the original manga page into the MLLM for captioning. Ultimately, we obtain captions for each panel and summarize the entire page with a story.
  • Figure 3: The entire pipeline of our proposed manga generation method. Users input a plain text story, and with the help of LLM, we plan the story and divide it into $K$ scripts. If $K$ is smaller than the maximum supported number of panels, the script will be filled with EMPTY. The scripts and randomly sampled Gaussian noise are then fed into our proposed MangaDiffusion model for manga generation, resulting in $K$ ordered panels. By taking the minimum value pixel-wise for each panel, a manga page is synthesized.
  • Figure 4: Architecture of MangaDiffusion. During the training stage, we split panel images from a complete manga page. A padding image is added if the number of panels is less than the maximum supported number. These panel images are inputted in batches into a pretrained VAE to obtain the latent representation. Each panel image has a corresponding caption to control its content generation. The transformer block consists of an intra-panel block and an inter-panel block for information interaction. The caption only participates in the computation within the intra-panel block. Timestep $t$ is injected into the model using adaLN-single chen2023pixart. The intra-panel mask is used to remove text and speech bubble boxes within the image, while the inter-panel mask is used to mask out the padding images.
  • Figure 5: Illustration of intra-panel mask. The first row represents the original panel, and the second row represents the corresponding intra-panel mask. The tokens in the white region are involved in attention calculation and loss calculation, while the black regions are not involved.
  • ...and 3 more figures