Success and Cost Elicit Convention Formation for Efficient Communication
Saujas Vaduguru, Yilun Hua, Yoav Artzi, Daniel Fried
TL;DR
The paper tackles the problem of enabling large multimodal models to form ad hoc linguistic conventions that improve communication efficiency over time. It introduces a framework where speaker and listener models engage in simulated repeated reference games, and uses preference-based IPO to optimize a joint success+cost utility, without relying on human-produced data. Empirical results show that models trained with success+cost gradually reduce utterance length while increasing communicative success, and human listeners respond faster to these models, indicating effective convention formation. The work demonstrates that both success and cost are necessary to elicit stable conventions, with open-class word usage increasing over time and conventions forming in both COCO and tangram domains, suggesting broad applicability to interactive AI systems and human-AI communication settings.
Abstract
Humans leverage shared conversational context to become increasingly successful and efficient at communicating over time. One manifestation of this is the formation of ad hoc linguistic conventions, which allow people to coordinate on short, less costly utterances that are understood using shared conversational context. We present a method to train large multimodal models to form conventions, enabling efficient communication. Our approach uses simulated reference games between models, and requires no additional human-produced data. In repeated reference games involving photographs and tangram images, our method enables models to communicate efficiently with people: reducing the message length by up to 41% while increasing success by 15% over the course of the interaction. Human listeners respond faster when interacting with our model that forms conventions. We also show that training based on success or cost alone is insufficient - both are necessary to elicit convention formation.
