Table of Contents
Fetching ...

Can Vision-Language Models Infer Speaker's Ignorance? The Role of Visual and Linguistic Cues

Ye-eun Cho, Yunho Maeng

TL;DR

This work probes whether vision-language models can perform pragmatic inferences about ignorance implicatures by jointly manipulating visual context and linguistic prompts. Through two experiments across three state-of-the-art VLMs, it reveals that most models rely primarily on lexical modifiers and pay limited heed to visual cues, though Claude shows improved integration when multiple contextual cues are available. The findings suggest a threshold-like, nonlinear cue integration in Claude that aligns more with human pragmatic reasoning, while GPT-4o and Gemini tend toward literal interpretations. These results highlight cue-combination as a potential marker of emergent pragmatic competence in multimodal systems and point to directions for developing VLMs with more robust context-sensitive inference capabilities.

Abstract

This study investigates whether vision-language models (VLMs) can perform pragmatic inference, focusing on ignorance implicatures, utterances that imply the speaker's lack of precise knowledge. To test this, we systematically manipulated contextual cues: the visually depicted situation (visual cue) and QUD-based linguistic prompts (linguistic cue). When only visual cues were provided, three state-of-the-art VLMs (GPT-4o, Gemini 1.5 Pro, and Claude 3.5 sonnet) produced interpretations largely based on the lexical meaning of the modified numerals. When linguistic cues were added to enhance contextual informativeness, Claude exhibited more human-like inference by integrating both types of contextual cues. In contrast, GPT and Gemini favored precise, literal interpretations. Although the influence of contextual cues increased, they treated each contextual cue independently and aligned them with semantic features rather than engaging in context-driven reasoning. These findings suggest that although the models differ in how they handle contextual cues, Claude's ability to combine multiple cues may signal emerging pragmatic competence in multimodal models.

Can Vision-Language Models Infer Speaker's Ignorance? The Role of Visual and Linguistic Cues

TL;DR

This work probes whether vision-language models can perform pragmatic inferences about ignorance implicatures by jointly manipulating visual context and linguistic prompts. Through two experiments across three state-of-the-art VLMs, it reveals that most models rely primarily on lexical modifiers and pay limited heed to visual cues, though Claude shows improved integration when multiple contextual cues are available. The findings suggest a threshold-like, nonlinear cue integration in Claude that aligns more with human pragmatic reasoning, while GPT-4o and Gemini tend toward literal interpretations. These results highlight cue-combination as a potential marker of emergent pragmatic competence in multimodal systems and point to directions for developing VLMs with more robust context-sensitive inference capabilities.

Abstract

This study investigates whether vision-language models (VLMs) can perform pragmatic inference, focusing on ignorance implicatures, utterances that imply the speaker's lack of precise knowledge. To test this, we systematically manipulated contextual cues: the visually depicted situation (visual cue) and QUD-based linguistic prompts (linguistic cue). When only visual cues were provided, three state-of-the-art VLMs (GPT-4o, Gemini 1.5 Pro, and Claude 3.5 sonnet) produced interpretations largely based on the lexical meaning of the modified numerals. When linguistic cues were added to enhance contextual informativeness, Claude exhibited more human-like inference by integrating both types of contextual cues. In contrast, GPT and Gemini favored precise, literal interpretations. Although the influence of contextual cues increased, they treated each contextual cue independently and aligned them with semantic features rather than engaging in context-driven reasoning. These findings suggest that although the models differ in how they handle contextual cues, Claude's ability to combine multiple cues may signal emerging pragmatic competence in multimodal models.

Paper Structure

This paper contains 14 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of the experimental procedure
  • Figure 2: Result of Experiment1 — Mean scores for the appropriateness of image-text pairs based on modifiers, situations and models
  • Figure 3: Result of Experiment2 — Mean scores for the appropriateness of image-text pairs based on modifiers, situations, and models across QUDs
  • Figure 4: Modeling a threshold effect via linear and nonlinear cue combination as a function of contextual cue number (adapted from parker2019cue)