MangaUB: A Manga Understanding Benchmark for Large Multimodal Models

Hikaru Ikuta; Leslie Wöhler; Kiyoharu Aizawa

MangaUB: A Manga Understanding Benchmark for Large Multimodal Models

Hikaru Ikuta, Leslie Wöhler, Kiyoharu Aizawa

TL;DR

MangaUB is designed to assess the recognition and understanding of content shown in a single panel as well as conveyed across multiple panels, allowing for a fine-grained analysis of a model’s various capabilities required for manga understanding.

Abstract

Manga is a popular medium that combines stylized drawings and text to convey stories. As manga panels differ from natural images, computational systems traditionally had to be designed specifically for manga. Recently, the adaptive nature of modern large multimodal models (LMMs) shows possibilities for more general approaches. To provide an analysis of the current capability of LMMs for manga understanding tasks and identifying areas for their improvement, we design and evaluate MangaUB, a novel manga understanding benchmark for LMMs. MangaUB is designed to assess the recognition and understanding of content shown in a single panel as well as conveyed across multiple panels, allowing for a fine-grained analysis of a model's various capabilities required for manga understanding. Our results show strong performance on the recognition of image content, while understanding the emotion and information conveyed across multiple panels is still challenging, highlighting future work towards LMMs for manga understanding.

MangaUB: A Manga Understanding Benchmark for Large Multimodal Models

TL;DR

Abstract

Paper Structure (21 sections, 2 figures, 5 tables)

This paper contains 21 sections, 2 figures, 5 tables.

Introduction
Related Work
Computational Analysis of Manga
LMM Benchmarks
Benchmark Design Overview
The Input Design of the Multi-Panel Tasks
Evaluation Methods
Models
Task Definition Details
Single-Panel Scene Recognition Tasks
Single-Panel Scene Understanding Tasks
Multi-Panel Recognition Task
Multi-Panel Understanding Task
Results and Discussions
Character Count
...and 6 more sections

Figures (2)

Figure 1: A summary of the tasks in the benchmark. Our tasks focus on recognition and understanding of image content for single-panel and multi-panel input. All manga panel images in this figure are referenced from the Manga109 dataset mtap_matsui_2017, courtesy of the authors of each manga indicated in the figure.
Figure 2: Performances for various open-source and proprietary models. Results of the evaluation of various open-source and proprietary models indicating strong performance on the recognition of image content, while understanding the emotion and information conveyed across multiple panels is challenging. The abbreviations used in the labels are described in Table \ref{['tab:prompt-count-stats']}.

MangaUB: A Manga Understanding Benchmark for Large Multimodal Models

TL;DR

Abstract

MangaUB: A Manga Understanding Benchmark for Large Multimodal Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)