Table of Contents
Fetching ...

Efficacy of Language Model Self-Play in Non-Zero-Sum Games

Austen Liao, Nicholas Tomlin, Dan Klein

TL;DR

This work probes whether self-play, a cornerstone of success in adversarial AI, can improve language models in cooperative negotiation tasks. Using a tunable Deal-or-No-Deal environment, the authors train models via filtered behavior cloning across cooperative, semi-competitive, and strictly competitive objectives, and evaluate them both in self-play and with human partners. They find substantial gains for cooperative and semi-competitive objectives, with improvements transferring to human collaboration and competition, while strictly competitive settings show limited transfer and overfitting risks. The results suggest self-play is a promising direction for language models in cooperative contexts, though achieving robust strategic negotiation and generalization remains an open challenge; the work also provides open-source tools and data to foster future research.

Abstract

Game-playing agents like AlphaGo have achieved superhuman performance through self-play, which is theoretically guaranteed to yield optimal policies in competitive games. However, most language tasks are partially or fully cooperative, so it is an open question whether techniques like self-play can effectively be used to improve language models. We empirically investigate this question in a negotiation game setting known as Deal or No Deal (DoND). Crucially, the objective in DoND can be modified to produce a fully cooperative game, a strictly competitive one, or anything in between. We finetune language models in self-play over multiple rounds of filtered behavior cloning in DoND for each of these objectives and evaluate them in self-play and in collaboration with humans. We find that language models improve substantially in self-play, achieving 14-17x higher scores in task reward after finetuning. Further, the trained models generalize to both cooperation and competition with humans, scoring 2.5-6x higher than base models. We view these results as an early promising sign for language model self-play in cooperative settings, despite a lack of theoretical guarantees.

Efficacy of Language Model Self-Play in Non-Zero-Sum Games

TL;DR

This work probes whether self-play, a cornerstone of success in adversarial AI, can improve language models in cooperative negotiation tasks. Using a tunable Deal-or-No-Deal environment, the authors train models via filtered behavior cloning across cooperative, semi-competitive, and strictly competitive objectives, and evaluate them both in self-play and with human partners. They find substantial gains for cooperative and semi-competitive objectives, with improvements transferring to human collaboration and competition, while strictly competitive settings show limited transfer and overfitting risks. The results suggest self-play is a promising direction for language models in cooperative contexts, though achieving robust strategic negotiation and generalization remains an open challenge; the work also provides open-source tools and data to foster future research.

Abstract

Game-playing agents like AlphaGo have achieved superhuman performance through self-play, which is theoretically guaranteed to yield optimal policies in competitive games. However, most language tasks are partially or fully cooperative, so it is an open question whether techniques like self-play can effectively be used to improve language models. We empirically investigate this question in a negotiation game setting known as Deal or No Deal (DoND). Crucially, the objective in DoND can be modified to produce a fully cooperative game, a strictly competitive one, or anything in between. We finetune language models in self-play over multiple rounds of filtered behavior cloning in DoND for each of these objectives and evaluate them in self-play and in collaboration with humans. We find that language models improve substantially in self-play, achieving 14-17x higher scores in task reward after finetuning. Further, the trained models generalize to both cooperation and competition with humans, scoring 2.5-6x higher than base models. We view these results as an early promising sign for language model self-play in cooperative settings, despite a lack of theoretical guarantees.

Paper Structure

This paper contains 36 sections, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: We ran experiments on a modified version of the Deal or No Deal negotiation game from lewis-etal-2017-deal. In this game, two players are presented with a shared collection of items and private value functions over those items. Players can send messages to each other and then each submit private proposals describing the items they wish to receive. If the proposals are compatible, then the items are scored. In our modified version of the task, players may receive reward based not only on their own item scores, but on the item scores of the other player as well. This modification allows us to convert Deal or No Deal into a cooperative or strictly competitive game.
  • Figure 2: Language model self-play significantly increased model performance in both cooperative and semi-competitive games. Moreover, these results generalized to collaboration and competition with humans, leading to improvements of up to 2.5$\times$ and 6$\times$ the baseline scores, respectively. We found that human-LM baseline scores were higher in the cooperative setting as humans can help "guide" models to avoid common failure modes.
  • Figure 3: Mean dialogue lengths (left) and aggregate vocabulary sizes (right) for every model iteration, for both semi-competitive and cooperative objectives. Dialogues under the semi-competitive objective progressively shrank in length, while dialogues under the cooperative objective grew significantly longer. Similarly, in the semi-competitive setting, vocabulary size trended downward, but the model maintained and even expanded its vocabulary when trained with the cooperative objective.
  • Figure 4: The rate of hallucinations or otherwise inconsistent messages and proposals declines over the course of self-play finetuning. We report this value as a per-message rate, rather than per-game.
  • Figure 5: System prompt used for the semi-competitive objective. Values in {brackets} are filled in based on the game context (i.e., item counts and private value functions).
  • ...and 9 more figures