Table of Contents
Fetching ...

MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games

Jacob Eisenstein, Fantine Huot, Adam Fisch, Jonathan Berant, Mirella Lapata

TL;DR

It is found that in many cases, language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario in which one agent attempts to summarize its information and the other agent immediately acts -- despite substantial headroom.

Abstract

We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective communication about private information. This enables an interactive scaling analysis, in which a fixed token budget is divided over a variable number of turns. We find that in many cases, language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario in which one agent attempts to summarize its information and the other agent immediately acts -- despite substantial headroom. This suggests that state-of-the-art models still suffer from significant weaknesses in planning and executing multi-turn collaborative conversations. We analyze the linguistic features of these dialogues, assessing the roles of sycophancy, information density, and discourse coherence. While there is no single linguistic explanation for the collaborative weaknesses of contemporary language models, we note that humans achieve comparable task success at superior token efficiency by producing dialogues that are more coherent than those produced by most language models. The proactive management of private information is a defining feature of real-world communication, and we hope that MT-PingEval will drive further work towards improving this capability.

MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games

TL;DR

It is found that in many cases, language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario in which one agent attempts to summarize its information and the other agent immediately acts -- despite substantial headroom.

Abstract

We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective communication about private information. This enables an interactive scaling analysis, in which a fixed token budget is divided over a variable number of turns. We find that in many cases, language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario in which one agent attempts to summarize its information and the other agent immediately acts -- despite substantial headroom. This suggests that state-of-the-art models still suffer from significant weaknesses in planning and executing multi-turn collaborative conversations. We analyze the linguistic features of these dialogues, assessing the roles of sycophancy, information density, and discourse coherence. While there is no single linguistic explanation for the collaborative weaknesses of contemporary language models, we note that humans achieve comparable task success at superior token efficiency by producing dialogues that are more coherent than those produced by most language models. The proactive management of private information is a defining feature of real-world communication, and we hope that MT-PingEval will drive further work towards improving this capability.
Paper Structure (48 sections, 9 equations, 16 figures, 5 tables)

This paper contains 48 sections, 9 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: An example of a successful COVR dialogue from Gemma3-12B.
  • Figure 2: An example of a successful dialogue for GPT-4o on the MD3 image selection task. The guesser sees all six images, while the describer sees only the highlighted image on the lower right.
  • Figure 3: An unsuccessful dialogue for Qwen-VL8B on the Tangram image selection task. The guesser sees all four images, while the describer sees only the highlighted image on the far right. The guesser answers prematurely, instead of using additional turns to narrow down the selection.
  • Figure 4: A successful but lucky dialogue for Gemini 2.5 Flash (thinking) on the name-game.
  • Figure 5: Isotoken evaluation, showing task accuracy with a constant token budget sharded over varying numbers of turns. In most cases, performance is flat or decreasing as the turn budget increases, indicating that the models fail to exploit interactivity.
  • ...and 11 more figures