Table of Contents
Fetching ...

Multimodal Shannon Game with Images

Vilém Zouhar, Sunit Bhattacharya, Ondřej Bojar

TL;DR

The paper extends the Shannon Game by adding an optional image modality to next-word prediction, formalized as predicting $p(w_i| w_{<i}, C)$ with context $C$ including image cues. It introduces the Multimodal Shannon Game (MMSG) and evaluates both human participants and GPT-2 under text-only and multimodal conditions across 17 sentences. Results show that image information improves confidence and accuracy for both humans and GPT-2, with POS-specific effects and stronger priming as context grows. The work connects priming to prompting in LMs and indicates that multimodal information can enhance language understanding and modeling in practical settings.

Abstract

The Shannon game has long been used as a thought experiment in linguistics and NLP, asking participants to guess the next letter in a sentence based on its preceding context. We extend the game by introducing an optional extra modality in the form of image information. To investigate the impact of multimodal information in this game, we use human participants and a language model (LM, GPT-2). We show that the addition of image information improves both self-reported confidence and accuracy for both humans and LM. Certain word classes, such as nouns and determiners, benefit more from the additional modality information. The priming effect in both humans and the LM becomes more apparent as the context size (extra modality information + sentence context) increases. These findings highlight the potential of multimodal information in improving language understanding and modeling.

Multimodal Shannon Game with Images

TL;DR

The paper extends the Shannon Game by adding an optional image modality to next-word prediction, formalized as predicting with context including image cues. It introduces the Multimodal Shannon Game (MMSG) and evaluates both human participants and GPT-2 under text-only and multimodal conditions across 17 sentences. Results show that image information improves confidence and accuracy for both humans and GPT-2, with POS-specific effects and stronger priming as context grows. The work connects priming to prompting in LMs and indicates that multimodal information can enhance language understanding and modeling in practical settings.

Abstract

The Shannon game has long been used as a thought experiment in linguistics and NLP, asking participants to guess the next letter in a sentence based on its preceding context. We extend the game by introducing an optional extra modality in the form of image information. To investigate the impact of multimodal information in this game, we use human participants and a language model (LM, GPT-2). We show that the addition of image information improves both self-reported confidence and accuracy for both humans and LM. Certain word classes, such as nouns and determiners, benefit more from the additional modality information. The priming effect in both humans and the LM becomes more apparent as the context size (extra modality information + sentence context) increases. These findings highlight the potential of multimodal information in improving language understanding and modeling.
Paper Structure (17 sections, 9 figures, 4 tables)

This paper contains 17 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Sentence "Several plates of food are set on a table." presented with an image. Given the first 3 words, the participant now has to think of the next word, rate their confidence and after "food" is revealed, self-evaluate how close they were.
  • Figure 2: The 4 configurations of multimodality for the same sentence ("Several plates of food are set on a table."). Given the first 3 words, the participant now has to think of the next word, rate their confidence and after food is revealed, self-evaluate how accurate they were. The configuration no image is not shown.
  • Figure 3: Annotation pipeline for Multimodal Shannon Game with images. The loop ends when the end of sentence is reached.
  • Figure 4: prediction and self-evaluation of a single word based on the information of "To be," and the image.
  • Figure 5: Heatmap of confidence$\times$self-eval scores across configurations. The x-axis is the confidence score. Each cell reports the relative number of such judgements. Correlations ($\rho$) are Pearson's correlation coefficients between the confidence and self-eval scores.
  • ...and 4 more figures