Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs

Yilun Hua; Yoav Artzi

Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs

Yilun Hua, Yoav Artzi

TL;DR

This work addresses whether multimodal LLMs spontaneously develop ad-hoc conventions for efficient communication in in-context conversations. It introduces ICCA, an automated framework that leverages human-human reference-game data to quantify in-context adaptation in MLLMs, measuring utterance length, lexical convergence, and accuracy, with $WNR$ as a sensitive lexical-change metric. Across five state-of-the-art MLLMs, results show a lack of spontaneous convention formation; only heavy prompting can induce some lexical efficiency in a subset of models, and stability/convergence remain poor. The findings reveal a gap between human conversational adaptation and current training/instruction-tuning, and ICCA offers a scalable, automated platform for ongoing evaluation and future improvements in in-context adaptation for multimodal models.

Abstract

Humans spontaneously use increasingly efficient language as interactions progress, by adapting and forming ad-hoc conventions. This phenomenon has been studied extensively using reference games, showing properties of human language that go beyond relaying intents. It remains unexplored whether multimodal large language models (MLLMs) similarly increase communication efficiency during interactions, and what mechanisms they may adopt for this purpose. We introduce ICCA, an automated framework to evaluate such conversational adaptation as an in-context behavior in MLLMs. We evaluate several state-of-the-art MLLMs, and observe that while they may understand the increasingly efficient language of their interlocutor, they do not spontaneously make their own language more efficient over time. This latter ability can only be elicited in some models (e.g., GPT-4) with heavy-handed prompting. This shows that this property of linguistic interaction does not arise from current training regimes, even though it is a common hallmark of human language. ICCA is available at https://github.com/lil-lab/ICCA.

Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs

TL;DR

as a sensitive lexical-change metric. Across five state-of-the-art MLLMs, results show a lack of spontaneous convention formation; only heavy prompting can induce some lexical efficiency in a subset of models, and stability/convergence remain poor. The findings reveal a gap between human conversational adaptation and current training/instruction-tuning, and ICCA offers a scalable, automated platform for ongoing evaluation and future improvements in in-context adaptation for multimodal models.

Abstract

Paper Structure (21 sections, 11 figures)

This paper contains 21 sections, 11 figures.

Introduction
Background and Related Work
Repeated Reference Games
Ad-hoc Adaptation in Interactions
Model Adaptation
The ICCA Framework
Model-as-speaker Experiments
Model-as-listener Experiments
History and Context Impact
Discussion
Tendency to Repeat Messages
Lexical Efficiency $\neq$ Communication Efficiency
Performance Degradation with Many-image Inputs
Conclusion
Implementation Details
...and 6 more sections

Figures (11)

Figure 1: Illustration of a reference game. The speaker (blue) and listener (orange) observe a shared set of images.The interaction progresses in six repetitions, each includes a trial for every context image. In each trial, the speaker describes a target image, and the listener has to select the correct target given the description only. For simplicity, this figure omits the feedback on listener actions. This interaction illustrates some of the effects of convention formation: the descriptions become shorter as the interaction progresses, and lexical choices converge to a subset of the words used in earlier repetitions.
Figure 2: Speaker experiments. Margins of errors are bootstrapped 95% CIs.
Figure 3: Listener experiments. Margins of Error are 95% bootstrapped CIs.
Figure 4: GloVe embedding similarity and WNR between messages from consecutive repetitions. Every increase in GloVe embedding similarity is captured by a corresponding decrease in WNR, and vice versa. Margins of Error are 95% bootstrapped CIs.
Figure 5: Word Novelty Distance for speaker experiments. Margins of Error are 95% bootstrapped CIs.
...and 6 more figures

Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs

TL;DR

Abstract

Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (11)