From Panels to Prose: Generating Literary Narratives from Comics
Ragav Sachdeva, Andrew Zisserman
TL;DR
This work tackles manga accessibility for visually impaired readers by converting visual pages into immersive prose. It introduces Magiv3, a unified model that handles panel/character/text/tail detection, associations, OCR, and character grounding within a single framework, paired with a page-to-prose pipeline that uses zero-shot VLMs and instruction-tuned LLMs to generate coherent narrative and style variants. A new dataset, PopCaptions, provides richly grounded, character-aware captions to benchmark manga understanding and grounding. Quantitative and qualitative evaluations across multiple datasets demonstrate improvements over prior methods in detection, association, and grounding, with prose generation achieving human-judge scores around the mid-3 range, highlighting practical potential and areas for improvement in crowded scenes and grounding accuracy. The work paves the way for accessible, high-quality manga narration and offers valuable resources for future multimodal storytelling research.
Abstract
Comics have long been a popular form of storytelling, offering visually engaging narratives that captivate audiences worldwide. However, the visual nature of comics presents a significant barrier for visually impaired readers, limiting their access to these engaging stories. In this work, we provide a pragmatic solution to this accessibility challenge by developing an automated system that generates text-based literary narratives from manga comics. Our approach aims to create an evocative and immersive prose that not only conveys the original narrative but also captures the depth and complexity of characters, their interactions, and the vivid settings in which they reside. To this end we make the following contributions: (1) We present a unified model, Magiv3, that excels at various functional tasks pertaining to comic understanding, such as localising panels, characters, texts, and speech-bubble tails, performing OCR, grounding characters etc. (2) We release human-annotated captions for over 3300 Japanese comic panels, along with character grounding annotations, and benchmark large vision-language models in their ability to understand comic images. (3) Finally, we demonstrate how integrating large vision-language models with Magiv3, can generate seamless literary narratives that allows visually impaired audiences to engage with the depth and richness of comic storytelling.
