Table of Contents
Fetching ...

Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names

Ragav Sachdeva, Gyungin Shin, Andrew Zisserman

TL;DR

This work introduces Magiv2, a chapter-wide manga transcription system that delivers transcripts with consistently named characters and improved speaker attribution, addressing key shortcomings of prior page-level approaches. It leverages a graph-based detection/association framework, a constraint-optimization approach for chapter-wide naming, and a four-step transcript generation process that prioritizes essential dialogue while leveraging tail cues. The authors release a new character bank (PopCharacters) and an extended test dataset (PopManga-X) to enable robust evaluation of naming, tails, and text classification, enabling scalable transcription across thousands of manga chapters. Their semi-supervised training strategy and tail-aware architecture yield significant gains over baselines in clustering, diarisation, and text categorization, with practical impact for visually impaired readers and accessibility pipelines. Together, Magiv2, the character bank, and PopManga-X systematically advance automatic, coherent, and named chapter transcripts for manga, facilitating broader access to narrative content.

Abstract

Enabling engagement of manga by visually impaired individuals presents a significant challenge due to its inherently visual nature. With the goal of fostering accessibility, this paper aims to generate a dialogue transcript of a complete manga chapter, entirely automatically, with a particular emphasis on ensuring narrative consistency. This entails identifying (i) what is being said, i.e., detecting the texts on each page and classifying them into essential vs non-essential, and (ii) who is saying it, i.e., attributing each dialogue to its speaker, while ensuring the same characters are named consistently throughout the chapter. To this end, we introduce: (i) Magiv2, a model that is capable of generating high-quality chapter-wide manga transcripts with named characters and significantly higher precision in speaker diarisation over prior works; (ii) an extension of the PopManga evaluation dataset, which now includes annotations for speech-bubble tail boxes, associations of text to corresponding tails, classifications of text as essential or non-essential, and the identity for each character box; and (iii) a new character bank dataset, which comprises over 11K characters from 76 manga series, featuring 11.5K exemplar character images in total, as well as a list of chapters in which they appear. The code, trained model, and both datasets can be found at: https://github.com/ragavsachdeva/magi

Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names

TL;DR

This work introduces Magiv2, a chapter-wide manga transcription system that delivers transcripts with consistently named characters and improved speaker attribution, addressing key shortcomings of prior page-level approaches. It leverages a graph-based detection/association framework, a constraint-optimization approach for chapter-wide naming, and a four-step transcript generation process that prioritizes essential dialogue while leveraging tail cues. The authors release a new character bank (PopCharacters) and an extended test dataset (PopManga-X) to enable robust evaluation of naming, tails, and text classification, enabling scalable transcription across thousands of manga chapters. Their semi-supervised training strategy and tail-aware architecture yield significant gains over baselines in clustering, diarisation, and text categorization, with practical impact for visually impaired readers and accessibility pipelines. Together, Magiv2, the character bank, and PopManga-X systematically advance automatic, coherent, and named chapter transcripts for manga, facilitating broader access to narrative content.

Abstract

Enabling engagement of manga by visually impaired individuals presents a significant challenge due to its inherently visual nature. With the goal of fostering accessibility, this paper aims to generate a dialogue transcript of a complete manga chapter, entirely automatically, with a particular emphasis on ensuring narrative consistency. This entails identifying (i) what is being said, i.e., detecting the texts on each page and classifying them into essential vs non-essential, and (ii) who is saying it, i.e., attributing each dialogue to its speaker, while ensuring the same characters are named consistently throughout the chapter. To this end, we introduce: (i) Magiv2, a model that is capable of generating high-quality chapter-wide manga transcripts with named characters and significantly higher precision in speaker diarisation over prior works; (ii) an extension of the PopManga evaluation dataset, which now includes annotations for speech-bubble tail boxes, associations of text to corresponding tails, classifications of text as essential or non-essential, and the identity for each character box; and (iii) a new character bank dataset, which comprises over 11K characters from 76 manga series, featuring 11.5K exemplar character images in total, as well as a list of chapters in which they appear. The code, trained model, and both datasets can be found at: https://github.com/ragavsachdeva/magi
Paper Structure (37 sections, 4 equations, 16 figures, 6 tables, 1 algorithm)

This paper contains 37 sections, 4 equations, 16 figures, 6 tables, 1 algorithm.

Figures (16)

  • Figure 1: (Left) Magi Sachdeva24 generates a page-level transcript, with non-essential texts and without character names. (Right) Magiv2 (ours) generates chapter-wide transcripts with principal characters consistently named across pages, higher precision for speaker diarisation and only dialogue-essential texts.
  • Figure 1: Web-scraping from Fandom. For a given series, available on Fandom, we can often scrape the list of chapters, principal characters, character appearances, and thumbnail images for the characters.
  • Figure 2: Inference pipeline. Given a manga chapter, along with a character bank: (1) each page is processed independently to detect various elements and their relationships, such as character and text boxes, and their association. Next, (2) using the character bank, names are assigned to detected character crops across all pages using a constraint optimisation approach. Finally, (3) the transcript is generated by performing OCR, ordering all the texts and removing non-essential texts.
  • Figure 2: Proportion of chapters for each character for two series: One Piece (a) and Vagabond (b). The proportion of chapters (y-axis) is defined as the number of chapters in which a character appears divided by the total number of chapters.
  • Figure 3: Simplified Detection and Association Architecture. The input to the model is an RGB image of a manga page. The transformer decoder outputs several feature vectors which are used to predict bounding boxes for characters, texts, panels and tails ("nodes"). These features are further processed in pairs to predict character-character, text-character and text-tail associations ("edges").
  • ...and 11 more figures