Table of Contents
Fetching ...

Multimodal Transformer for Comics Text-Cloze

Emanuele Vivoli, Joan Lafuente Baeza, Ernest Valveny Llobet, Dimosthenis Karatzas

TL;DR

This work tackles the comics text-cloze problem by introducing a Multimodal Large Language Model (Multimodal-LLM) tailored to fuse visual panel content with balloon text. The approach hinges on a Vision–Text pipeline featuring a domain-adapted, self-supervised ResNet-50 encoder and flexible extractors (Faster R-CNN or SAM) integrated with VL-T5, enabling both classification and generative text-cloze. Key contributions include a 10% accuracy gain over prior methods on easy and hard variants, a new generation OCR dataset (Textract) that further improves performance, and a demonstrated capability to extend the task to generative dialogue. The work provides extensive ablations, analyzes OCR and panel representations, and releases code and OCR data to advance research in comics analytics and multimodal reasoning. Overall, the proposed domain-adapted visual encoder, efficient multimodal fusion, and OCR enhancements collectively advance structured reasoning over multimodal comics content with practical implications for downstream narrative understanding and generation in visual storytelling contexts.

Abstract

This work explores a closure task in comics, a medium where visual and textual elements are intricately intertwined. Specifically, Text-cloze refers to the task of selecting the correct text to use in a comic panel, given its neighboring panels. Traditional methods based on recurrent neural networks have struggled with this task due to limited OCR accuracy and inherent model limitations. We introduce a novel Multimodal Large Language Model (Multimodal-LLM) architecture, specifically designed for Text-cloze, achieving a 10% improvement over existing state-of-the-art models in both its easy and hard variants. Central to our approach is a Domain-Adapted ResNet-50 based visual encoder, fine-tuned to the comics domain in a self-supervised manner using SimCLR. This encoder delivers comparable results to more complex models with just one-fifth of the parameters. Additionally, we release new OCR annotations for this dataset, enhancing model input quality and resulting in another 1% improvement. Finally, we extend the task to a generative format, establishing new baselines and expanding the research possibilities in the field of comics analysis.

Multimodal Transformer for Comics Text-Cloze

TL;DR

This work tackles the comics text-cloze problem by introducing a Multimodal Large Language Model (Multimodal-LLM) tailored to fuse visual panel content with balloon text. The approach hinges on a Vision–Text pipeline featuring a domain-adapted, self-supervised ResNet-50 encoder and flexible extractors (Faster R-CNN or SAM) integrated with VL-T5, enabling both classification and generative text-cloze. Key contributions include a 10% accuracy gain over prior methods on easy and hard variants, a new generation OCR dataset (Textract) that further improves performance, and a demonstrated capability to extend the task to generative dialogue. The work provides extensive ablations, analyzes OCR and panel representations, and releases code and OCR data to advance research in comics analytics and multimodal reasoning. Overall, the proposed domain-adapted visual encoder, efficient multimodal fusion, and OCR enhancements collectively advance structured reasoning over multimodal comics content with practical implications for downstream narrative understanding and generation in visual storytelling contexts.

Abstract

This work explores a closure task in comics, a medium where visual and textual elements are intricately intertwined. Specifically, Text-cloze refers to the task of selecting the correct text to use in a comic panel, given its neighboring panels. Traditional methods based on recurrent neural networks have struggled with this task due to limited OCR accuracy and inherent model limitations. We introduce a novel Multimodal Large Language Model (Multimodal-LLM) architecture, specifically designed for Text-cloze, achieving a 10% improvement over existing state-of-the-art models in both its easy and hard variants. Central to our approach is a Domain-Adapted ResNet-50 based visual encoder, fine-tuned to the comics domain in a self-supervised manner using SimCLR. This encoder delivers comparable results to more complex models with just one-fifth of the parameters. Additionally, we release new OCR annotations for this dataset, enhancing model input quality and resulting in another 1% improvement. Finally, we extend the task to a generative format, establishing new baselines and expanding the research possibilities in the field of comics analysis.
Paper Structure (40 sections, 2 equations, 8 figures, 10 tables)

This paper contains 40 sections, 2 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Two instances of the text-cloze task: a question panel (with masked balloons) is provided together with three candidate answers. Three context panels are provided to help choose the correct answer.
  • Figure 1: Reduced dimensional distribution of test objects via t-SNE on Comics ResNet, featuring K-means detected clusters.
  • Figure 2: Architecture of VL-T5 with text pipeline (OCR system extractor and VL-T5 embeddings), and image pipeline (custom extractors and image encoders).
  • Figure 2: Representative objects from each identified cluster in Figure \ref{['fig:tsne']}, maintaining consistent cluster labeling.
  • Figure 3: Representation of textboxes (balloon) separator tokens. Every group of tokens referred to textbox (balloon) has two tokens prefix that represents the context (light-green) and the textbox id (colored tokens). The orange tokens are the balloon text tokens.
  • ...and 3 more figures