Table of Contents
Fetching ...

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

Yizhe Zhang, Jiarui Lu, Navdeep Jaitly

TL;DR

The paper develops the Entity-Deduction Arena (EDA) as a surrogate benchmark to evaluate LLMs' multi-turn planning and conversational reasoning when deducing an unknown entity. It benchmarks several LLMs (GPT-4, GPT-3.5, Claude, Vicuna, Mistral) on Things and Celebrities datasets, across 20-turn games with a judge and a guesser, using metrics including turns, success, and yes responses. Key findings show strong models like GPT-4 achieve higher success with fewer turns; open-source models can close the gap through Behavior Cloning and reinforcement learning from game-play. The work provides guidance for training autonomous agents to handle ambiguity and shares code and data to accelerate future research.

Abstract

Large language models (LLMs) are effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively. This capability requires complex understanding, state tracking, reasoning and planning over multiple conversational turns. However, directly measuring this can be challenging. In this paper, we offer a surrogate problem which assesses an LLMs's capability to deduce an entity unknown to itself, but revealed to a judge, by asking the judge a series of queries. This \textit{entity-deducing game} can serve as an evaluation framework to probe the conversational reasoning and planning capabilities of language models. We systematically evaluate various LLMs and discover significant differences in their performance on this task. We find that strong LLMs like GPT-4 outperform human players by a large margin. We further employ Behavior Cloning (BC) to examine whether a weaker model is capable of imitating a stronger model and generalizing to data or domains, using only the demonstrations from a stronger model. We finally propose to use Reinforcement Learning to enhance reasoning and planning capacity of Vicuna models through episodes of game playing, which lead to significant performance improvement. We hope that this problem offers insights into how autonomous agents could be trained to behave more intelligently in ambiguous circumstances.

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

TL;DR

The paper develops the Entity-Deduction Arena (EDA) as a surrogate benchmark to evaluate LLMs' multi-turn planning and conversational reasoning when deducing an unknown entity. It benchmarks several LLMs (GPT-4, GPT-3.5, Claude, Vicuna, Mistral) on Things and Celebrities datasets, across 20-turn games with a judge and a guesser, using metrics including turns, success, and yes responses. Key findings show strong models like GPT-4 achieve higher success with fewer turns; open-source models can close the gap through Behavior Cloning and reinforcement learning from game-play. The work provides guidance for training autonomous agents to handle ambiguity and shares code and data to accelerate future research.

Abstract

Large language models (LLMs) are effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively. This capability requires complex understanding, state tracking, reasoning and planning over multiple conversational turns. However, directly measuring this can be challenging. In this paper, we offer a surrogate problem which assesses an LLMs's capability to deduce an entity unknown to itself, but revealed to a judge, by asking the judge a series of queries. This \textit{entity-deducing game} can serve as an evaluation framework to probe the conversational reasoning and planning capabilities of language models. We systematically evaluate various LLMs and discover significant differences in their performance on this task. We find that strong LLMs like GPT-4 outperform human players by a large margin. We further employ Behavior Cloning (BC) to examine whether a weaker model is capable of imitating a stronger model and generalizing to data or domains, using only the demonstrations from a stronger model. We finally propose to use Reinforcement Learning to enhance reasoning and planning capacity of Vicuna models through episodes of game playing, which lead to significant performance improvement. We hope that this problem offers insights into how autonomous agents could be trained to behave more intelligently in ambiguous circumstances.
Paper Structure (39 sections, 2 equations, 4 figures, 14 tables)

This paper contains 39 sections, 2 equations, 4 figures, 14 tables.

Figures (4)

  • Figure 1: The entity deducing game resembles real scenarios where the agent may need to make strategic decisions regarding the clarification question to be asked based on the current conversation to elicit the actual user intent in as few turns as possible.
  • Figure 2: A breakdown of the score of each model on the evaluated items, with the x-axis representing the order of difficulty ranging from easy to difficult. Scores are averaged over 5 repetitions.
  • Figure 3: Composition of EDAThings and Celebrities datasets.
  • Figure 4: Game play UI interface for collecting human baseline. On the left, human players are given prompt instructions equivalent to LLM guessers. An optional retrospection UI can be toggled to display what ChatGPT would've chosen to ask in the last turn. On the right, a leaderboard with Human and LLM player performance is shown.