Table of Contents
Fetching ...

Masked Generative Story Transformer with Character Guidance and Caption Augmentation

Christos Papadimitriou, Giorgos Filandrianos, Maria Lymperaiou, Giorgos Stamou

TL;DR

This work tackles Story Visualization, where generating coherent image sequences that align with narrative captions is challenging due to the need for visual fidelity and temporal consistency. It introduces MaskGST, a MaskGIT-inspired Transformer that uses cross-attention to full story captions, alongside an image-agnostic LLM-based caption augmentation and a Character Guidance mechanism that jointly tunes text- and character-conditioned logits. The approach yields state-of-the-art Char-F1, Char-Acc and BLEU-2/3 on Pororo-SV, with the best FID among Transformer-based SV methods, while remaining computationally efficient on a single 16 GB GPU. Human studies validate the quantitative gains, and ablations show the substantial contribution of caption augmentation and character guidance to overall quality. The results suggest a promising direction for efficient, character-aware SV and potential extensions to other generative tasks requiring concept-specific control.

Abstract

Story Visualization (SV) is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences. Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately, to improve the rendering of characters. On the contrary, we embrace a completely parallel transformer-based approach, exclusively relying on Cross-Attention with past and future captions to achieve consistency. Additionally, we propose a Character Guidance technique to focus on the generation of characters in an implicit manner, by forming a combination of text-conditional and character-conditional logits in the logit space. We also employ a caption-augmentation technique, carried out by a Large Language Model (LLM), to enhance the robustness of our approach. The combination of these methods culminates into state-of-the-art (SOTA) results over various metrics in the most prominent SV benchmark (Pororo-SV), attained with constraint resources while achieving superior computational complexity compared to previous arts. The validity of our quantitative results is supported by a human survey.

Masked Generative Story Transformer with Character Guidance and Caption Augmentation

TL;DR

This work tackles Story Visualization, where generating coherent image sequences that align with narrative captions is challenging due to the need for visual fidelity and temporal consistency. It introduces MaskGST, a MaskGIT-inspired Transformer that uses cross-attention to full story captions, alongside an image-agnostic LLM-based caption augmentation and a Character Guidance mechanism that jointly tunes text- and character-conditioned logits. The approach yields state-of-the-art Char-F1, Char-Acc and BLEU-2/3 on Pororo-SV, with the best FID among Transformer-based SV methods, while remaining computationally efficient on a single 16 GB GPU. Human studies validate the quantitative gains, and ablations show the substantial contribution of caption augmentation and character guidance to overall quality. The results suggest a promising direction for efficient, character-aware SV and potential extensions to other generative tasks requiring concept-specific control.

Abstract

Story Visualization (SV) is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences. Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately, to improve the rendering of characters. On the contrary, we embrace a completely parallel transformer-based approach, exclusively relying on Cross-Attention with past and future captions to achieve consistency. Additionally, we propose a Character Guidance technique to focus on the generation of characters in an implicit manner, by forming a combination of text-conditional and character-conditional logits in the logit space. We also employ a caption-augmentation technique, carried out by a Large Language Model (LLM), to enhance the robustness of our approach. The combination of these methods culminates into state-of-the-art (SOTA) results over various metrics in the most prominent SV benchmark (Pororo-SV), attained with constraint resources while achieving superior computational complexity compared to previous arts. The validity of our quantitative results is supported by a human survey.
Paper Structure (62 sections, 10 equations, 9 figures, 7 tables)

This paper contains 62 sections, 10 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: MaskGST's Transformer model
  • Figure 2: Main characters featured in Pororo-SV
  • Figure 3: Qualitative Comparison between our model (MaskGST-CG$_{\pm}$ w/ aug. captions) and CMOTAahn2023story across 4 story examples.
  • Figure 4: Comparison of Character Guidance Factor ($\lambda$) values for different evaluation metrics.
  • Figure 5: Example of caption augmentation using ChatGPT.
  • ...and 4 more figures