Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study

Yinghao Li; Haorui Wang; Chao Zhang

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study

Yinghao Li, Haorui Wang, Chao Zhang

TL;DR

It is indicated that while LLMs possess the foundational abilities required for this task, they struggle to integrate these into a coherent, multi-step logical reasoning process needed to solve Minesweeper.

Abstract

Large Language Models (LLMs) have shown remarkable proficiency in language understanding and have been successfully applied to a variety of real-world tasks through task-specific fine-tuning or prompt engineering. Despite these advancements, it remains an open question whether LLMs are fundamentally capable of reasoning and planning, or if they primarily rely on recalling and synthesizing information from their training data. In our research, we introduce a novel task -- Minesweeper -- specifically designed in a format unfamiliar to LLMs and absent from their training datasets. This task challenges LLMs to identify the locations of mines based on numerical clues provided by adjacent opened cells. Successfully completing this task requires an understanding of each cell's state, discerning spatial relationships between the clues and mines, and strategizing actions based on logical deductions drawn from the arrangement of the cells. Our experiments, including trials with the advanced GPT-4 model, indicate that while LLMs possess the foundational abilities required for this task, they struggle to integrate these into a coherent, multi-step logical reasoning process needed to solve Minesweeper. These findings highlight the need for further research to understand the nature of reasoning capabilities in LLMs under similar circumstances, and to explore pathways towards more sophisticated AI reasoning and planning models.

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study

TL;DR

Abstract

Paper Structure (19 sections, 5 figures, 4 tables)

This paper contains 19 sections, 5 figures, 4 tables.

Introduction
Related Works
Minesweeper
Board Understanding
Experiment Setup
Results
Minesweeper Gameplay
Experiment Setup
Metrics
Results and Discussion
Objective Scores
Reasoning Chains
Conclusion
Experiment Setup
GPT Versions
...and 4 more sections

Figures (5)

Figure 1: An example of Minesweeper on a $9\times9$ board containing 10.0 mines, along with its interaction format. Subfigure \ref{['subfig:minesweeper.gui']} displays the game's GUI; Subfigure \ref{['subfig:minesweeper.table']} shows a plain-text, table-formatted representation of the game board, enhanced with color for improved visualization; Subfigure \ref{['subfig:minesweeper.coord']} depicts the coordinate-based plain-text representation of the board; and Subfigure \ref{['subfig:minesweeper.action.history']} provides a log of the player's (in this case, the first author's) actions, where "L", "R", and "M" denote left-click, right-click, and middle-click actions, respectively.
Figure 2: Interaction prompting modes. The "Natural Conversation" mode encompasses the full interaction history, whereas the "Compact History" mode condenses the actions generated by the LLM and the game's feedback into a succinct, unified prompt.
Figure 3: A detailed analysis of the example interactions performed by GPT-3.5-instruct. Arrows oriented to the left and right signify left and right mouse clicks, respectively. The arrow pointing upwards represents a middle-click. The majority actions are technically allowed but do not effectively advance the gameplay.
Figure 4: A case study of a "valid" action and its corresponding reasoning generated by GPT-3.5-16k for solving Minesweeper. Blue indicates logical reasoning; red and golden are illogical ones.
Figure 5: This figure presents a case study of the reasoning sequences formulated by GPT-4 with coordinate board representation during action planning. The GUI on the left shows the board states and actions taken by the agent during the game. Elements highlighted in blue represent accurate facts and logical inferences as assessed by human evaluation, while those marked in red indicate incorrect observations or illogical conclusions. Notably, the final generated action "F(3,1)" deviates from the permissible action formats, resulting in the termination of the game.

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study

TL;DR

Abstract

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study

Authors

TL;DR

Abstract

Table of Contents

Figures (5)