Table of Contents
Fetching ...

The Manga Whisperer: Automatically Generating Transcriptions for Comics

Ragav Sachdeva, Andrew Zisserman

TL;DR

This work introduces Magi, a unified model that automatically diarises manga pages by detecting panels, text blocks, and characters, clustering identities without a predefined count, and linking dialogues to speakers to produce a reading-order transcript. It frames the task as graph generation, separating detection/association from transcript synthesis and leveraging in-context cues from the full page. The authors develop a two-stage training regime on Mangadex-1.5M and PopManga, introduce a DAG-based panel ordering method, and build the PopManga and Mangadex-1.5M datasets with extensive annotations. Empirically, Magi achieves state-of-the-art performance across detection, clustering (notably AMI), and diarisation metrics, substantially improving accessibility for visual-impaired readers and enabling automated, order-respecting manga transcripts. Future work envisions combining this diarisation with large language models to incorporate dialogue history and higher-level understanding for richer descriptions.

Abstract

In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet, the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work, we seek to address this substantial barrier, with the aim of ensuring that manga can be appreciated and actively engaged by everyone. Specifically, we tackle the problem of diarisation i.e. generating a transcription of who said what and when, in a fully automatic way. To this end, we make the following contributions: (1) we present a unified model, Magi, that is able to (a) detect panels, text boxes and character boxes, (b) cluster characters by identity (without knowing the number of clusters apriori), and (c) associate dialogues to their speakers; (2) we propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript; (3) we annotate an evaluation benchmark for this task using publicly available [English] manga pages. The code, evaluation datasets and the pre-trained model can be found at: https://github.com/ragavsachdeva/magi.

The Manga Whisperer: Automatically Generating Transcriptions for Comics

TL;DR

This work introduces Magi, a unified model that automatically diarises manga pages by detecting panels, text blocks, and characters, clustering identities without a predefined count, and linking dialogues to speakers to produce a reading-order transcript. It frames the task as graph generation, separating detection/association from transcript synthesis and leveraging in-context cues from the full page. The authors develop a two-stage training regime on Mangadex-1.5M and PopManga, introduce a DAG-based panel ordering method, and build the PopManga and Mangadex-1.5M datasets with extensive annotations. Empirically, Magi achieves state-of-the-art performance across detection, clustering (notably AMI), and diarisation metrics, substantially improving accessibility for visual-impaired readers and enabling automated, order-respecting manga transcripts. Future work envisions combining this diarisation with large language models to incorporate dialogue history and higher-level understanding for richer descriptions.

Abstract

In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet, the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work, we seek to address this substantial barrier, with the aim of ensuring that manga can be appreciated and actively engaged by everyone. Specifically, we tackle the problem of diarisation i.e. generating a transcription of who said what and when, in a fully automatic way. To this end, we make the following contributions: (1) we present a unified model, Magi, that is able to (a) detect panels, text boxes and character boxes, (b) cluster characters by identity (without knowing the number of clusters apriori), and (c) associate dialogues to their speakers; (2) we propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript; (3) we annotate an evaluation benchmark for this task using publicly available [English] manga pages. The code, evaluation datasets and the pre-trained model can be found at: https://github.com/ragavsachdeva/magi.
Paper Structure (31 sections, 1 equation, 12 figures, 5 tables)

This paper contains 31 sections, 1 equation, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Given a manga page, our model is able to: (a) detect panels, text blocks and character boxes; (b) cluster character boxes by their identity; (c) associate texts to their speaker; and (d) generate a dialogue transcription in the correct reading order. Here we show the predicted panels (in green), text blocks (in red) and characters (in blue) on a page from Hunter $\times$ Hunter by Yoshihiro Togashi. The predicted character identity associations are shown by lines joining the character box centres. For visual clarity, we do not explicitly show the text to speaker associations but provide the generated transcript.
  • Figure 2: The Magi Architecture: Given a manga page as input, our model predicts bounding boxes for panels, text blocks and characters, and associates the detected character-character and text-character pairs. The model ingests a high resolution manga page as input to a CNN backbone, followed by a transformer encoder-decoder resulting in $N\times$[OBJ] + [C2C] + [T2C] tokens. The [OBJ] tokens are processed by the detection heads (box and class) to obtain the bounding boxes and their classifications. The [OBJ] tokens corresponding to detected objects are then processed in pairs, along with [C2C] and [T2C], by a character matching module and a speaker association module respectively resulting in character clusters and diarisation.
  • Figure 3: Panel ordering: On the left are the ordering predictions using ordering and on the right using ours. Images: Prism Heart © Mai Asatsuki.
  • Figure 4: Bounding box predictions determined by the Magi model for characters, text blocks and panels, as well as clustering predictions (as nodes and edges). For the purposes of visualisation, we remove redundant connections for characters that are already connected via transitivity. Best viewed digitally. Notice that the model has successfully matched characters despite occlusion/partial visibility (girl: A4, hand: B1), changing viewpoints (boy: A5, girl: A2), and varying fidelity (girl: A1, boy: B2). The model can also detect non-human characters (dog-like creatures: A1, octopus-like creature: B3). We also show a failure case where the model incorrectly matches two different characters (one is wearing checked scarf, while the other is wearing checked coat, they are brothers and look similar).
  • Figure 5: Text to speaker predictions generated by the Magi model. Each predicted text box is connected to a predicted character box using a line. The opacity of the line reflects the confidence of the model (the darker the line, the more confident the model is). Each predicted character box has a number at its centre based on the clustering predictions. We also show the final generated transcript. Note that all the dialogues are in the correct reading order. For text to speaker predictions that have a low confidence score ($<0.4$) we replace the predicted speaker with $\langle?\rangle$ in the generated transcript and let the reader infer it from context. Best viewed digitally.
  • ...and 7 more figures