Table of Contents
Fetching ...

ComicsPAP: understanding comic strips by picking the correct panel

Emanuele Vivoli, Artemis Llabrés, Mohamed Ali Souibgui, Marco Bertini, Ernest Valveny Llobet, Dimosthenis Karatzas

TL;DR

ComicsPAP tackles the gap in multimodal understanding posed by comic strips, where implicit panel boundaries and fragmented narration challenge current models. It introduces a large-scale dataset (over 100k samples) with five tasks under the Pick-A-Panel paradigm to evaluate narrative anticipation, multi-frame reasoning, and co-reference through panel selection. Zero-shot evaluations show that state-of-the-art LMMs perform near chance, motivating targeted fine-tuning on a 100k training set using LoRA, where smaller 3B and 7B models achieve substantial gains, often surpassing much larger models. The dataset and fine-tuned models are publicly available, establishing ComicsPAP as a robust resource to advance research in multimodal comic comprehension and cross-panel reasoning.

Abstract

Large multimodal models (LMMs) have made impressive strides in image captioning, VQA, and video comprehension, yet they still struggle with the intricate temporal and spatial cues found in comics. To address this gap, we introduce ComicsPAP, a large-scale benchmark designed for comic strip understanding. Comprising over 100k samples and organized into 5 subtasks under a Pick-a-Panel framework, ComicsPAP demands models to identify the missing panel in a sequence. Our evaluations, conducted under both multi-image and single-image protocols, reveal that current state-of-the-art LMMs perform near chance on these tasks, underscoring significant limitations in capturing sequential and contextual dependencies. To close the gap, we adapted LMMs for comic strip understanding, obtaining better results on ComicsPAP than 10x bigger models, demonstrating that ComicsPAP offers a robust resource to drive future research in multimodal comic comprehension.

ComicsPAP: understanding comic strips by picking the correct panel

TL;DR

ComicsPAP tackles the gap in multimodal understanding posed by comic strips, where implicit panel boundaries and fragmented narration challenge current models. It introduces a large-scale dataset (over 100k samples) with five tasks under the Pick-A-Panel paradigm to evaluate narrative anticipation, multi-frame reasoning, and co-reference through panel selection. Zero-shot evaluations show that state-of-the-art LMMs perform near chance, motivating targeted fine-tuning on a 100k training set using LoRA, where smaller 3B and 7B models achieve substantial gains, often surpassing much larger models. The dataset and fine-tuned models are publicly available, establishing ComicsPAP as a robust resource to advance research in multimodal comic comprehension and cross-panel reasoning.

Abstract

Large multimodal models (LMMs) have made impressive strides in image captioning, VQA, and video comprehension, yet they still struggle with the intricate temporal and spatial cues found in comics. To address this gap, we introduce ComicsPAP, a large-scale benchmark designed for comic strip understanding. Comprising over 100k samples and organized into 5 subtasks under a Pick-a-Panel framework, ComicsPAP demands models to identify the missing panel in a sequence. Our evaluations, conducted under both multi-image and single-image protocols, reveal that current state-of-the-art LMMs perform near chance on these tasks, underscoring significant limitations in capturing sequential and contextual dependencies. To close the gap, we adapted LMMs for comic strip understanding, obtaining better results on ComicsPAP than 10x bigger models, demonstrating that ComicsPAP offers a robust resource to drive future research in multimodal comic comprehension.

Paper Structure

This paper contains 17 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: ComicsPAP dataset composition (left) and our Comic-adapted LMM performances vs. zero-shot LMMs (right).
  • Figure 2: Overview of the five tasks in ComicsPAP. In Caption Relevance, the image is shown only for illustrative purposes: the real task doesn't provide the image.
  • Figure 3: Overview of validation and test annotations.
  • Figure 4: Overview of automatic task creation from the manually annotated story: the story pages (top); panel detection and ordering (middle); and the $N=5$ context and $M=6$ randomly sampled options (bottom).
  • Figure 5: Single image example for the task of sequence filling. All panels are reshaped as squared images; panels are located to minimize waste of space; and option numbers are provided at the bottom of the panels.
  • ...and 1 more figures