Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names
Ragav Sachdeva, Gyungin Shin, Andrew Zisserman
TL;DR
This work introduces Magiv2, a chapter-wide manga transcription system that delivers transcripts with consistently named characters and improved speaker attribution, addressing key shortcomings of prior page-level approaches. It leverages a graph-based detection/association framework, a constraint-optimization approach for chapter-wide naming, and a four-step transcript generation process that prioritizes essential dialogue while leveraging tail cues. The authors release a new character bank (PopCharacters) and an extended test dataset (PopManga-X) to enable robust evaluation of naming, tails, and text classification, enabling scalable transcription across thousands of manga chapters. Their semi-supervised training strategy and tail-aware architecture yield significant gains over baselines in clustering, diarisation, and text categorization, with practical impact for visually impaired readers and accessibility pipelines. Together, Magiv2, the character bank, and PopManga-X systematically advance automatic, coherent, and named chapter transcripts for manga, facilitating broader access to narrative content.
Abstract
Enabling engagement of manga by visually impaired individuals presents a significant challenge due to its inherently visual nature. With the goal of fostering accessibility, this paper aims to generate a dialogue transcript of a complete manga chapter, entirely automatically, with a particular emphasis on ensuring narrative consistency. This entails identifying (i) what is being said, i.e., detecting the texts on each page and classifying them into essential vs non-essential, and (ii) who is saying it, i.e., attributing each dialogue to its speaker, while ensuring the same characters are named consistently throughout the chapter. To this end, we introduce: (i) Magiv2, a model that is capable of generating high-quality chapter-wide manga transcripts with named characters and significantly higher precision in speaker diarisation over prior works; (ii) an extension of the PopManga evaluation dataset, which now includes annotations for speech-bubble tail boxes, associations of text to corresponding tails, classifications of text as essential or non-essential, and the identity for each character box; and (iii) a new character bank dataset, which comprises over 11K characters from 76 manga series, featuring 11.5K exemplar character images in total, as well as a list of chapters in which they appear. The code, trained model, and both datasets can be found at: https://github.com/ragavsachdeva/magi
