Table of Contents
Fetching ...

CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding

Emanuele Vivoli, Marco Bertini, Dimosthenis Karatzas

TL;DR

A novel benchmark, CoMix, designed to evaluate the multi-task capabilities of models in comic analysis, which addresses a broader range of tasks including object detection, speaker identification, character re-identification, reading order, and multi-modal reasoning tasks like character naming and dialogue generation.

Abstract

The comic domain is rapidly advancing with the development of single-page analysis and synthesis models. However, evaluation metrics and datasets lag behind, often limited to small-scale or single-style test sets. We introduce a novel benchmark, CoMix, designed to evaluate the multi-task capabilities of models in comic analysis. Unlike existing benchmarks that focus on isolated tasks such as object detection or text recognition, CoMix addresses a broader range of tasks including object detection, speaker identification, character re-identification, reading order, and multi-modal reasoning tasks like character naming and dialogue generation. Our benchmark comprises three existing datasets with expanded annotations to support multi-task evaluation. To mitigate the over-representation of manga-style data, we have incorporated a new dataset of carefully selected American comic-style books, thereby enriching the diversity of comic styles. CoMix is designed to assess pre-trained models in zero-shot and limited fine-tuning settings, probing their transfer capabilities across different comic styles and tasks. The validation split of the benchmark is publicly available for research purposes, and an evaluation server for the held-out test split is also provided. Comparative results between human performance and state-of-the-art models reveal a significant performance gap, highlighting substantial opportunities for advancements in comic understanding. The dataset, baseline models, and code are accessible at https://github.com/emanuelevivoli/CoMix-dataset. This initiative sets a new standard for comprehensive comic analysis, providing the community with a common benchmark for evaluation on a large and varied set.

CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding

TL;DR

A novel benchmark, CoMix, designed to evaluate the multi-task capabilities of models in comic analysis, which addresses a broader range of tasks including object detection, speaker identification, character re-identification, reading order, and multi-modal reasoning tasks like character naming and dialogue generation.

Abstract

The comic domain is rapidly advancing with the development of single-page analysis and synthesis models. However, evaluation metrics and datasets lag behind, often limited to small-scale or single-style test sets. We introduce a novel benchmark, CoMix, designed to evaluate the multi-task capabilities of models in comic analysis. Unlike existing benchmarks that focus on isolated tasks such as object detection or text recognition, CoMix addresses a broader range of tasks including object detection, speaker identification, character re-identification, reading order, and multi-modal reasoning tasks like character naming and dialogue generation. Our benchmark comprises three existing datasets with expanded annotations to support multi-task evaluation. To mitigate the over-representation of manga-style data, we have incorporated a new dataset of carefully selected American comic-style books, thereby enriching the diversity of comic styles. CoMix is designed to assess pre-trained models in zero-shot and limited fine-tuning settings, probing their transfer capabilities across different comic styles and tasks. The validation split of the benchmark is publicly available for research purposes, and an evaluation server for the held-out test split is also provided. Comparative results between human performance and state-of-the-art models reveal a significant performance gap, highlighting substantial opportunities for advancements in comic understanding. The dataset, baseline models, and code are accessible at https://github.com/emanuelevivoli/CoMix-dataset. This initiative sets a new standard for comprehensive comic analysis, providing the community with a common benchmark for evaluation on a large and varied set.
Paper Structure (23 sections, 10 figures, 14 tables, 2 algorithms)

This paper contains 23 sections, 10 figures, 14 tables, 2 algorithms.

Figures (10)

  • Figure 1: Composition of the CoMix benchmark. The top part of the figure provides a qualitative representation of the datasets included in CoMix. The accompanying bar charts depict the differences between the original annotations and those extended in CoMix. The left chart shows the increased number of annotations per dataset, whereas the right chart details the increase per task.
  • Figure 2: The CoMix benchmark contains 4 computational tasks (object detection, speaker identification, character re-identification, panel-text sorting) and 2 multi-modal reasoning tasks (character naming and dialog generation) which require models to detect objects and their relation, as well as reading text. The figure shows the annotations added for each comic page, and on the left is depicted an example annotation of multi-modal reasoning task dialog generation.
  • Figure 3: Image from "DCM", original annotations (left) and our CoMix corrected and integrated annotations (right). Every point indicates a re-identified character.
  • Figure 4: Image from "eBDtheque", original annotations (left) and our CoMix corrected and integrated annotations (right).
  • Figure 5: Image from "PopManga", original annotations (left) and our CoMix corrected and integrated annotations (right).
  • ...and 5 more figures