Efficacy of Language Model Self-Play in Non-Zero-Sum Games
Austen Liao, Nicholas Tomlin, Dan Klein
TL;DR
This work probes whether self-play, a cornerstone of success in adversarial AI, can improve language models in cooperative negotiation tasks. Using a tunable Deal-or-No-Deal environment, the authors train models via filtered behavior cloning across cooperative, semi-competitive, and strictly competitive objectives, and evaluate them both in self-play and with human partners. They find substantial gains for cooperative and semi-competitive objectives, with improvements transferring to human collaboration and competition, while strictly competitive settings show limited transfer and overfitting risks. The results suggest self-play is a promising direction for language models in cooperative contexts, though achieving robust strategic negotiation and generalization remains an open challenge; the work also provides open-source tools and data to foster future research.
Abstract
Game-playing agents like AlphaGo have achieved superhuman performance through self-play, which is theoretically guaranteed to yield optimal policies in competitive games. However, most language tasks are partially or fully cooperative, so it is an open question whether techniques like self-play can effectively be used to improve language models. We empirically investigate this question in a negotiation game setting known as Deal or No Deal (DoND). Crucially, the objective in DoND can be modified to produce a fully cooperative game, a strictly competitive one, or anything in between. We finetune language models in self-play over multiple rounds of filtered behavior cloning in DoND for each of these objectives and evaluate them in self-play and in collaboration with humans. We find that language models improve substantially in self-play, achieving 14-17x higher scores in task reward after finetuning. Further, the trained models generalize to both cooperation and competition with humans, scoring 2.5-6x higher than base models. We view these results as an early promising sign for language model self-play in cooperative settings, despite a lack of theoretical guarantees.
