Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment

Christopher Driggers-Ellis; Nachiketh Tibrewal; Rohit Bogulla; Harsh Khanna; Sangpil Youm; Christan Grant; Bonnie Dorr

Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment

Christopher Driggers-Ellis, Nachiketh Tibrewal, Rohit Bogulla, Harsh Khanna, Sangpil Youm, Christan Grant, Bonnie Dorr

TL;DR

This work identifies and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies and concludes with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.

Abstract

A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.

Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment

TL;DR

Abstract

Paper Structure (19 sections, 4 equations, 2 figures, 3 tables)

This paper contains 19 sections, 4 equations, 2 figures, 3 tables.

Introduction
Related Work
Prior Studies and Datasets for Comic Interpretation
Hallucination Taxonomies in Comic Interpretation
Methodology
Dataset
Experiment
Prompt:
Hallucination Study
Results
Semantic Similarity Analysis
Hallucination Frequency Analysis
Discussion
Benchmarking Experiment
Hallucination Frequency Discussion
...and 4 more sections

Figures (2)

Figure 1: An image from our benchmarking corpus (left) paired with its ground truth interpretation (right).
Figure 2: The responses each VLM gave in our benchmarking experiment for the example page. Instances of Quoting Text that is Not There (Red), Misattributing Dialogue/Narration (Blue), Incorrect Object/Scenery Description (Green), Incorrect Character Description (Orange) are highlighted. Highlighting is not exhaustive for the sake of readability. Cosine Similarity (Cos) and KL Divergence (KL) scores assigned to each response are included in each response header.

Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment

TL;DR

Abstract

Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment

Authors

TL;DR

Abstract

Table of Contents

Figures (2)