Table of Contents
Fetching ...

DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models

Patrick Kwon, Chen Chen

TL;DR

DreamingComics tackles layout-aware story visualization by jointly modeling panel and character placement with an LLM-based layout generator and a Dream-Illustrator built on a pretrained video diffusion-transformer. It introduces RegionalRoPE for region-grounded latent positioning and a masked condition loss to constrain attention to designated regions, enabling multi-subject identity and style preservation within comic layouts. The approach is trained on a curated image-layout and text-layout dataset and leverages FramePack-based fast single-frame customization for efficiency. Empirical results show substantial gains in character consistency (29.2%) and style similarity (36.2%), with strong layout fidelity and favorable user studies, demonstrating a robust framework for controllable, layout-aware story generation.

Abstract

Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject's visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation of our approach, showing a 29.2% increase in character consistency and a 36.2% increase in style similarity compared to previous methods, while displaying high spatial accuracy. Our project page is available at https://yj7082126.github.io/dreamingcomics/

DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models

TL;DR

DreamingComics tackles layout-aware story visualization by jointly modeling panel and character placement with an LLM-based layout generator and a Dream-Illustrator built on a pretrained video diffusion-transformer. It introduces RegionalRoPE for region-grounded latent positioning and a masked condition loss to constrain attention to designated regions, enabling multi-subject identity and style preservation within comic layouts. The approach is trained on a curated image-layout and text-layout dataset and leverages FramePack-based fast single-frame customization for efficiency. Empirical results show substantial gains in character consistency (29.2%) and style similarity (36.2%), with strong layout fidelity and favorable user studies, demonstrating a robust framework for controllable, layout-aware story generation.

Abstract

Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject's visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation of our approach, showing a 29.2% increase in character consistency and a 36.2% increase in style similarity compared to previous methods, while displaying high spatial accuracy. Our project page is available at https://yj7082126.github.io/dreamingcomics/

Paper Structure

This paper contains 14 sections, 7 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: (left) Overview of DreamingComics, a story visualization framework for multi subject and layout control. The dialogues are post-edited by humans. (right) Examples from DreamingComics, which is capable of generating layout-controlled stories with diverse art styles such as pencil illustrations, Disney-style animation, digital line-art, live-action drama, and cel animation.
  • Figure 2: Overview of our image customization pipeline. The input reference images are encoded as token sequences $c_{1:n}$ along with the noise latent $z_t$ and the text latent $z_p$, passed to the stream of diffusion transformer blocks. We calculate a custom regional RoPE from the layout condition and apply it to the encoded references. During training, we calculate a Masked Condition Loss between the cross-attention map and the given layout condition, encouraging the model to position references within the layout.
  • Figure 3: Given a list of textual descriptions for each panel, the finetuned LLM outputs a spatial layout for each panel and characters as a set of bounding boxes. Note that our layout, compared to the layout generated from the same prompt using GPT-4 Achiam2023GPT4TR, occupies most of the panel region, correctly orders the panel (top-to-bottom, right-to-left), and draws plausible character boxes, constituting a "good comic layout".
  • Figure 4: For illustration, we visualize the original positional indices for RoPE at (a) and the new indices at (c). The blue square indicates the intended layout region, and the red square indicates the actual generated region. The original RoPE restricts the reference content to the top-left corner, while ours can correctly position it according to the given layout.
  • Figure 5: Comparison between the $\text{CAMs}$ of our model trained without the masked condition loss (top) and with the masked condition loss (bottom). Using our new loss helps to position the attention around the target layout, which is evident in the third column (Transformer layer = 2), naturally inducing the model during training to position the character.
  • ...and 2 more figures