Table of Contents
Fetching ...

Success and Cost Elicit Convention Formation for Efficient Communication

Saujas Vaduguru, Yilun Hua, Yoav Artzi, Daniel Fried

TL;DR

The paper tackles the problem of enabling large multimodal models to form ad hoc linguistic conventions that improve communication efficiency over time. It introduces a framework where speaker and listener models engage in simulated repeated reference games, and uses preference-based IPO to optimize a joint success+cost utility, without relying on human-produced data. Empirical results show that models trained with success+cost gradually reduce utterance length while increasing communicative success, and human listeners respond faster to these models, indicating effective convention formation. The work demonstrates that both success and cost are necessary to elicit stable conventions, with open-class word usage increasing over time and conventions forming in both COCO and tangram domains, suggesting broad applicability to interactive AI systems and human-AI communication settings.

Abstract

Humans leverage shared conversational context to become increasingly successful and efficient at communicating over time. One manifestation of this is the formation of ad hoc linguistic conventions, which allow people to coordinate on short, less costly utterances that are understood using shared conversational context. We present a method to train large multimodal models to form conventions, enabling efficient communication. Our approach uses simulated reference games between models, and requires no additional human-produced data. In repeated reference games involving photographs and tangram images, our method enables models to communicate efficiently with people: reducing the message length by up to 41% while increasing success by 15% over the course of the interaction. Human listeners respond faster when interacting with our model that forms conventions. We also show that training based on success or cost alone is insufficient - both are necessary to elicit convention formation.

Success and Cost Elicit Convention Formation for Efficient Communication

TL;DR

The paper tackles the problem of enabling large multimodal models to form ad hoc linguistic conventions that improve communication efficiency over time. It introduces a framework where speaker and listener models engage in simulated repeated reference games, and uses preference-based IPO to optimize a joint success+cost utility, without relying on human-produced data. Empirical results show that models trained with success+cost gradually reduce utterance length while increasing communicative success, and human listeners respond faster to these models, indicating effective convention formation. The work demonstrates that both success and cost are necessary to elicit stable conventions, with open-class word usage increasing over time and conventions forming in both COCO and tangram domains, suggesting broad applicability to interactive AI systems and human-AI communication settings.

Abstract

Humans leverage shared conversational context to become increasingly successful and efficient at communicating over time. One manifestation of this is the formation of ad hoc linguistic conventions, which allow people to coordinate on short, less costly utterances that are understood using shared conversational context. We present a method to train large multimodal models to form conventions, enabling efficient communication. Our approach uses simulated reference games between models, and requires no additional human-produced data. In repeated reference games involving photographs and tangram images, our method enables models to communicate efficiently with people: reducing the message length by up to 41% while increasing success by 15% over the course of the interaction. Human listeners respond faster when interacting with our model that forms conventions. We also show that training based on success or cost alone is insufficient - both are necessary to elicit convention formation.

Paper Structure

This paper contains 38 sections, 2 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: A real example interaction with our model playing a repeated reference game. In each turn, the speaker model has to describe a single image in the context of a set of images and previous interactions. A listener tries to guess the image being described. We train speaker models that adapt to communicate efficiently (using fewer words) with people by forming ad hoc conventions.
  • Figure 2: ① We simulate interactions between speaker and listener models, sampling multiple descriptions for a target image at the same stage of the interaction. ② We create preference pairs based on the communicative utility of descriptions. ③ We train the speaker with preference optimization using the pairs created from simulated games.
  • Figure 3: The success+cost speaker communicates increasingly successfully and efficiently over the course of the interaction with human listeners. The model achieves increasing accuracy while decreasing message length. This efficiency is also reflected in people responding faster to utterances produced by the success+cost speaker compared to other models. Error bars show standard error.
  • Figure 4: Examples of adaptation over the course of a game by the success+cost speaker model. In the COCO example, we see an example of the model converge to a convention that doesn't reference 'kites', since it doesn't distinguish the two images. In the tangrams example, we see how the model uses a detailed description ("with triangle tail") in an initial turn, but in later turns drops the detail without losing success.
  • Figure 5: success+cost speakers achieve low and decreasing WNR. Along with decreasing message length, this suggests convention formation. success models achieve low WNR, but repeat entire messages without growing more efficient. Error bars show standard error.
  • ...and 7 more figures