Table of Contents
Fetching ...

MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, Kiyoharu Aizawa

TL;DR

This paper introduces MangaOCR and MangaVQA as two benchmarks to evaluate multimodal manga understanding, focusing on in-image text recognition and narrative-context visual question answering over two-page spreads. It also presents MangaLMM, a manga-specialized open-source model finetuned from Qwen2.5-VL to jointly handle OCR and VQA, trained with a consolidated OCR dataset and a large synthetic VQA corpus generated via GPT-4o. Across extensive experiments, MangaLMM demonstrates strong OCR performance and competitive VQA results, outperforming many open-source baselines and challenging proprietary models on these manga-specific tasks. The work provides open benchmarks, synthetic data, and a practical baseline to advance domain-specific multimodal understanding in manga and similar richly text-embedded visual media.

Abstract

Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.

MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

TL;DR

This paper introduces MangaOCR and MangaVQA as two benchmarks to evaluate multimodal manga understanding, focusing on in-image text recognition and narrative-context visual question answering over two-page spreads. It also presents MangaLMM, a manga-specialized open-source model finetuned from Qwen2.5-VL to jointly handle OCR and VQA, trained with a consolidated OCR dataset and a large synthetic VQA corpus generated via GPT-4o. Across extensive experiments, MangaLMM demonstrates strong OCR performance and competitive VQA results, outperforming many open-source baselines and challenging proprietary models on these manga-specific tasks. The work provides open benchmarks, synthetic data, and a practical baseline to advance domain-specific multimodal understanding in manga and similar richly text-embedded visual media.

Abstract

Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.

Paper Structure

This paper contains 31 sections, 1 equation, 13 figures, 15 tables.

Figures (13)

  • Figure 1: Overview of MangaVQA and MangaLMM. We present MangaVQA, a newly proposed benchmark for multimodal context understanding, consisting of 526 manually constructed question–answer pairs. We also develop MangaLMM, a manga-specialized model jointly trained to handle both MangaOCR and MangaVQA tasks.
  • Figure 2: Illustration of a two-page spread from the Manga109 dataset.
  • Figure 3: Distributions in MangaVQA. The dataset is structured along four key axes: (a) Required Information, (b) Answer Type, (c) 5W1H, and (d) Author Type.
  • Figure 4: Main categorization of MangaVQA: Answer type. MangaVQA consists of (1) Exact Extraction, where the answer is directly extracted from the image; and (2) Descriptive Answering, where the answer requires explanatory or contextual responses beyond simple word extraction.
  • Figure 5: Category-wise score breakdown. Compared to the original model (Qwen2.5-VL-7B-Instruct), our trained MangaLMM improves scores across every tag in every category.
  • ...and 8 more figures