Table of Contents
Fetching ...

Do Large Language Models have Problem-Solving Capability under Incomplete Information Scenarios?

Yuyan Chen, Tianhao Yu, Yueze Li, Songzhou Yan, Sijia Liu, Jiaqing Liang, Yanghua Xiao

TL;DR

The paper investigates whether large language models can solve problems under incomplete information by introducing BrainKing, a benchmark that fuses elements of Who is undercover and Twenty Questions. It defines three difficulty modes, an automated evaluation pipeline with accurate/round-based win rates and a confusion/rethink signal, and tests a spectrum of models from GPT-4 to smaller open-source LLMs. Empirical results show GPT-4 leading across modes, with Claude2 as a strong competitor, while smaller models struggle under misleading information and harder starting points. The study provides a scalable framework for assessing incomplete-information processing in LLMs and highlights the importance of deception handling and adaptive questioning for robust problem solving.

Abstract

The evaluation of the problem-solving capability under incomplete information scenarios of Large Language Models (LLMs) is increasingly important, encompassing capabilities such as questioning, knowledge search, error detection, and path planning. Current research mainly focus on LLMs' problem-solving capability such as ``Twenty Questions''. However, these kinds of games do not require recognizing misleading cues which are necessary in the incomplete information scenario. Moreover, the existing game such as ``Who is undercover'' are highly subjective, making it challenging for evaluation. Therefore, in this paper, we introduce a novel game named BrainKing based on the ``Who is undercover'' and ``Twenty Questions'' for evaluating LLM capabilities under incomplete information scenarios. It requires LLMs to identify target entities with limited yes-or-no questions and potential misleading answers. By setting up easy, medium, and hard difficulty modes, we comprehensively assess the performance of LLMs across various aspects. Our results reveal the capabilities and limitations of LLMs in BrainKing, providing significant insights of LLM problem-solving levels.

Do Large Language Models have Problem-Solving Capability under Incomplete Information Scenarios?

TL;DR

The paper investigates whether large language models can solve problems under incomplete information by introducing BrainKing, a benchmark that fuses elements of Who is undercover and Twenty Questions. It defines three difficulty modes, an automated evaluation pipeline with accurate/round-based win rates and a confusion/rethink signal, and tests a spectrum of models from GPT-4 to smaller open-source LLMs. Empirical results show GPT-4 leading across modes, with Claude2 as a strong competitor, while smaller models struggle under misleading information and harder starting points. The study provides a scalable framework for assessing incomplete-information processing in LLMs and highlights the importance of deception handling and adaptive questioning for robust problem solving.

Abstract

The evaluation of the problem-solving capability under incomplete information scenarios of Large Language Models (LLMs) is increasingly important, encompassing capabilities such as questioning, knowledge search, error detection, and path planning. Current research mainly focus on LLMs' problem-solving capability such as ``Twenty Questions''. However, these kinds of games do not require recognizing misleading cues which are necessary in the incomplete information scenario. Moreover, the existing game such as ``Who is undercover'' are highly subjective, making it challenging for evaluation. Therefore, in this paper, we introduce a novel game named BrainKing based on the ``Who is undercover'' and ``Twenty Questions'' for evaluating LLM capabilities under incomplete information scenarios. It requires LLMs to identify target entities with limited yes-or-no questions and potential misleading answers. By setting up easy, medium, and hard difficulty modes, we comprehensively assess the performance of LLMs across various aspects. Our results reveal the capabilities and limitations of LLMs in BrainKing, providing significant insights of LLM problem-solving levels.
Paper Structure (13 sections, 12 figures, 5 tables)

This paper contains 13 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: A sample of the "Who is undercover" game (a) and the "Twenty Questions" game (b).
  • Figure 2: The overview of the proposed BrainKing benchmark, including three modes.
  • Figure 3: The comparison of the performance (accuracy and rounds) of different LLMs across three modes.
  • Figure 4: The win rate of different LLMs in the proposed BrainKing benchmark.
  • Figure 5: The relationship between accuracy and rounds across three modes of different LLMs.
  • ...and 7 more figures