Table of Contents
Fetching ...

Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!

Jiwan Chung, Seungwon Lim, Jaehyun Jeon, Seungbeen Lee, Youngjae Yu

TL;DR

Understanding Pun with Image Explanations (UNPIE), a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities, indicates that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context, particularly as the complexity of the tasks increases.

Abstract

Humans possess multimodal literacy, allowing them to actively integrate information from various modalities to form reasoning. Faced with challenges like lexical ambiguity in text, we supplement this with other modalities, such as thumbnail images or textbook illustrations. Is it possible for machines to achieve a similar multimodal understanding capability? In response, we present Understanding Pun with Image Explanations (UNPIE), a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities. Puns serve as the ideal subject for this evaluation due to their intrinsic ambiguity. Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings. We pose three multimodal challenges with the annotations to assess different aspects of multimodal literacy; Pun Grounding, Disambiguation, and Reconstruction. The results indicate that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context, particularly as the complexity of the tasks increases.

Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!

TL;DR

Understanding Pun with Image Explanations (UNPIE), a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities, indicates that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context, particularly as the complexity of the tasks increases.

Abstract

Humans possess multimodal literacy, allowing them to actively integrate information from various modalities to form reasoning. Faced with challenges like lexical ambiguity in text, we supplement this with other modalities, such as thumbnail images or textbook illustrations. Is it possible for machines to achieve a similar multimodal understanding capability? In response, we present Understanding Pun with Image Explanations (UNPIE), a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities. Puns serve as the ideal subject for this evaluation due to their intrinsic ambiguity. Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings. We pose three multimodal challenges with the annotations to assess different aspects of multimodal literacy; Pun Grounding, Disambiguation, and Reconstruction. The results indicate that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context, particularly as the complexity of the tasks increases.
Paper Structure (21 sections, 7 figures, 9 tables)

This paper contains 21 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Puns naturally occur with images to enhance understanding zenner2018one, making them natural candidates for testing active multimodal understanding capacity of machines. Examples of puns accompanied by visual explanations from r/puns subreddit on Reddit.
  • Figure 2: The UNPIE benchmark comprises three multimodal tasks: 1. Identifying the specific phrase in an English sentence that constitutes a pun, using the provided (a) pun explanation image; 2. Choosing the translation of the pun sentence that aligns more closely with the given (b) pun disambiguator image; and 3. Reconstructing the English pun sentence from its translated version, aided by the corresponding (a) pun explanation image.
  • Figure 3: Comparison of homographic (left) and heterographic (right) puns in UNPIE dataset along with the respective disambiguator visual annotations.
  • Figure 4: An example of our pun explanation image generation process. A human worker interacts with an off-the-shelf text-to-image model, iteratively guiding the model to produce an image that satisfies each specified criterion.
  • Figure 5: A screenshot of the human annotation interface for pun-aware text translation.
  • ...and 2 more figures