Table of Contents
Fetching ...

From Panels to Prose: Generating Literary Narratives from Comics

Ragav Sachdeva, Andrew Zisserman

TL;DR

This work tackles manga accessibility for visually impaired readers by converting visual pages into immersive prose. It introduces Magiv3, a unified model that handles panel/character/text/tail detection, associations, OCR, and character grounding within a single framework, paired with a page-to-prose pipeline that uses zero-shot VLMs and instruction-tuned LLMs to generate coherent narrative and style variants. A new dataset, PopCaptions, provides richly grounded, character-aware captions to benchmark manga understanding and grounding. Quantitative and qualitative evaluations across multiple datasets demonstrate improvements over prior methods in detection, association, and grounding, with prose generation achieving human-judge scores around the mid-3 range, highlighting practical potential and areas for improvement in crowded scenes and grounding accuracy. The work paves the way for accessible, high-quality manga narration and offers valuable resources for future multimodal storytelling research.

Abstract

Comics have long been a popular form of storytelling, offering visually engaging narratives that captivate audiences worldwide. However, the visual nature of comics presents a significant barrier for visually impaired readers, limiting their access to these engaging stories. In this work, we provide a pragmatic solution to this accessibility challenge by developing an automated system that generates text-based literary narratives from manga comics. Our approach aims to create an evocative and immersive prose that not only conveys the original narrative but also captures the depth and complexity of characters, their interactions, and the vivid settings in which they reside. To this end we make the following contributions: (1) We present a unified model, Magiv3, that excels at various functional tasks pertaining to comic understanding, such as localising panels, characters, texts, and speech-bubble tails, performing OCR, grounding characters etc. (2) We release human-annotated captions for over 3300 Japanese comic panels, along with character grounding annotations, and benchmark large vision-language models in their ability to understand comic images. (3) Finally, we demonstrate how integrating large vision-language models with Magiv3, can generate seamless literary narratives that allows visually impaired audiences to engage with the depth and richness of comic storytelling.

From Panels to Prose: Generating Literary Narratives from Comics

TL;DR

This work tackles manga accessibility for visually impaired readers by converting visual pages into immersive prose. It introduces Magiv3, a unified model that handles panel/character/text/tail detection, associations, OCR, and character grounding within a single framework, paired with a page-to-prose pipeline that uses zero-shot VLMs and instruction-tuned LLMs to generate coherent narrative and style variants. A new dataset, PopCaptions, provides richly grounded, character-aware captions to benchmark manga understanding and grounding. Quantitative and qualitative evaluations across multiple datasets demonstrate improvements over prior methods in detection, association, and grounding, with prose generation achieving human-judge scores around the mid-3 range, highlighting practical potential and areas for improvement in crowded scenes and grounding accuracy. The work paves the way for accessible, high-quality manga narration and offers valuable resources for future multimodal storytelling research.

Abstract

Comics have long been a popular form of storytelling, offering visually engaging narratives that captivate audiences worldwide. However, the visual nature of comics presents a significant barrier for visually impaired readers, limiting their access to these engaging stories. In this work, we provide a pragmatic solution to this accessibility challenge by developing an automated system that generates text-based literary narratives from manga comics. Our approach aims to create an evocative and immersive prose that not only conveys the original narrative but also captures the depth and complexity of characters, their interactions, and the vivid settings in which they reside. To this end we make the following contributions: (1) We present a unified model, Magiv3, that excels at various functional tasks pertaining to comic understanding, such as localising panels, characters, texts, and speech-bubble tails, performing OCR, grounding characters etc. (2) We release human-annotated captions for over 3300 Japanese comic panels, along with character grounding annotations, and benchmark large vision-language models in their ability to understand comic images. (3) Finally, we demonstrate how integrating large vision-language models with Magiv3, can generate seamless literary narratives that allows visually impaired audiences to engage with the depth and richness of comic storytelling.

Paper Structure

This paper contains 27 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Our approach to transforming comics into accessible narratives begins with generating transcripts (left), capturing dialogue. In the image, green boxes represent panels, blue boxes represent characters, red boxes represent text, and purple boxes represent speech-bubble tails. Solid lines indicate character clusters, while dashed lines show associations between dialogues and their speakers. This is followed by character-grounded panel captioning (top-right), where grounded phrases are colour-coded, and their corresponding predicted character boxes are overlaid on the panel images, adding descriptions and placing characters in context. Finally, these elements are combined into prose (bottom-right), creating a rich, immersive narrative for visually impaired readers. Images ©YamatoNoHane by Saki Kaori.
  • Figure 2: Overview of the 'Page to Prose' Pipeline. The stages of the pipeline are described in section \ref{['sec:pipeline']}. Magiv3 is described in section \ref{['sec:magiv3']}, and the Captioning and Prose Generation in section \ref{['sec:text']}.
  • Figure 3: The Magiv3 architecture and its three use cases. The input to the model is an image and prompt pair. The output is text-only (tokens) predicted autoregressively for each of the three use cases. Images: ©HanzaiKousyouninMinegishiEitarou by Ki Takashi.
  • Figure 4: Caption Comparison. We show the captions predicted by various vision-language models, both open-source and proprietary, on a manga panel. The mistakes are highlighted in red. Overall, all the models make mistakes to some degree, such as miscounting the number of characters (there are four; notice the face below the '!?' on the left), hallucinating objects e.g. "cross", "notebook" etc. or incorrectly identifying the setting (which is an airplane). However, the general trend is that the captions get more accurate from left to right.
  • Figure 5: Character-Grounded Panel Captions. We show the captions predicted by GPT-4o-2024-08-06 on various manga panels and visualise the bounding boxes for characters grounded by our model (colour coded and numbered for visualisation only).
  • ...and 8 more figures