The Manga Whisperer: Automatically Generating Transcriptions for Comics
Ragav Sachdeva, Andrew Zisserman
TL;DR
This work introduces Magi, a unified model that automatically diarises manga pages by detecting panels, text blocks, and characters, clustering identities without a predefined count, and linking dialogues to speakers to produce a reading-order transcript. It frames the task as graph generation, separating detection/association from transcript synthesis and leveraging in-context cues from the full page. The authors develop a two-stage training regime on Mangadex-1.5M and PopManga, introduce a DAG-based panel ordering method, and build the PopManga and Mangadex-1.5M datasets with extensive annotations. Empirically, Magi achieves state-of-the-art performance across detection, clustering (notably AMI), and diarisation metrics, substantially improving accessibility for visual-impaired readers and enabling automated, order-respecting manga transcripts. Future work envisions combining this diarisation with large language models to incorporate dialogue history and higher-level understanding for richer descriptions.
Abstract
In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet, the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work, we seek to address this substantial barrier, with the aim of ensuring that manga can be appreciated and actively engaged by everyone. Specifically, we tackle the problem of diarisation i.e. generating a transcription of who said what and when, in a fully automatic way. To this end, we make the following contributions: (1) we present a unified model, Magi, that is able to (a) detect panels, text boxes and character boxes, (b) cluster characters by identity (without knowing the number of clusters apriori), and (c) associate dialogues to their speakers; (2) we propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript; (3) we annotate an evaluation benchmark for this task using publicly available [English] manga pages. The code, evaluation datasets and the pre-trained model can be found at: https://github.com/ragavsachdeva/magi.
