Table of Contents
Fetching ...

LLM-Cave: A benchmark and light environment for large language models reasoning and decision-making system

Huanyu Li, Zongyuan Li, Wei Huang, Xian Guo

TL;DR

The paper addresses the gap in evaluating multi-step reasoning in LLMs with a lightweight, accessible benchmark. It introduces LLM-Cave and two inference strategies: Chain of Speculation and Planner-Critic. Experiments across models show that structured reasoning improves success rates and rewards, especially for weaker models, albeit with higher compute needs. The work offers a practical benchmark and direction for improving LLM reasoning and decision-making.

Abstract

Large language models (LLMs) such as ChatGPT o1, ChatGPT o3, and DeepSeek R1 have shown great potential in solving difficult problems. However, current LLM evaluation benchmarks are limited to one-step interactions. Some of the existing sequence decision-making environments, such as TextStarCraftII and LLM-PySC2, are too complicated and require hours of interaction to complete a game. In this paper, we introduce LLM-Cave, a benchmark and light environment for LLM reasoning and decision-making systems. This environment is a classic instance in the era of Symbolism. Artificial intelligence enables the agent to explore the environment and avoid potential losses by reasoning about nearby dangers using partial observable state information. In the experiment, we evaluated the sequential reasoning ability, decision-making performance and computational efficiency of mainstream large language models (LLMs) such as GPT-4o-mini, o1-mini, and DeepSeek-R1. Experiments show that while Deepseek-R1 achieved the highest success rate on complex reasoning tasks, smaller models like 4o-mini significantly narrowed the performance gap on challenges by employing Chain of Speculation and Planner-Critic strategies, at the expense of reduced computational efficiency. This indicates that structured, multi-step reasoning combined with an LLM-based feedback mechanism can substantially enhance an LLM's decision-making capabilities, providing a promising direction for improving reasoning in weaker models and suggesting a new reasoning-centered benchmark for LLM assessment. Our code is open-sourced in https://github.com/puleya1277/CaveEnv.

LLM-Cave: A benchmark and light environment for large language models reasoning and decision-making system

TL;DR

The paper addresses the gap in evaluating multi-step reasoning in LLMs with a lightweight, accessible benchmark. It introduces LLM-Cave and two inference strategies: Chain of Speculation and Planner-Critic. Experiments across models show that structured reasoning improves success rates and rewards, especially for weaker models, albeit with higher compute needs. The work offers a practical benchmark and direction for improving LLM reasoning and decision-making.

Abstract

Large language models (LLMs) such as ChatGPT o1, ChatGPT o3, and DeepSeek R1 have shown great potential in solving difficult problems. However, current LLM evaluation benchmarks are limited to one-step interactions. Some of the existing sequence decision-making environments, such as TextStarCraftII and LLM-PySC2, are too complicated and require hours of interaction to complete a game. In this paper, we introduce LLM-Cave, a benchmark and light environment for LLM reasoning and decision-making systems. This environment is a classic instance in the era of Symbolism. Artificial intelligence enables the agent to explore the environment and avoid potential losses by reasoning about nearby dangers using partial observable state information. In the experiment, we evaluated the sequential reasoning ability, decision-making performance and computational efficiency of mainstream large language models (LLMs) such as GPT-4o-mini, o1-mini, and DeepSeek-R1. Experiments show that while Deepseek-R1 achieved the highest success rate on complex reasoning tasks, smaller models like 4o-mini significantly narrowed the performance gap on challenges by employing Chain of Speculation and Planner-Critic strategies, at the expense of reduced computational efficiency. This indicates that structured, multi-step reasoning combined with an LLM-based feedback mechanism can substantially enhance an LLM's decision-making capabilities, providing a promising direction for improving reasoning in weaker models and suggesting a new reasoning-centered benchmark for LLM assessment. Our code is open-sourced in https://github.com/puleya1277/CaveEnv.

Paper Structure

This paper contains 17 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: LLM-Cave Environment. In the LLM-Cave environment, the LLM controls a game agent; the agent has to explore the cave and find the gold. In the cave, the agent can only get the information of the current grid, while there are pits (holes) and wumpus (monster) that may kill the agent. Near the pits and wumpus are breeze and stench. The LLM should reason the position of the pit and wumpus according to the observed information of breeze and stench to safely explore the cave and find the gold.
  • Figure 2: The workflow of LLM agents interacting with its environment. The Chain of Speculation Mechanism and the Planner-Critic Mechanism are applied within LLM-Cave. The Chain of Speculation maintains explicit hypotheses about pit and Wumpus positions, updating them after each observation. The Planner proposes an action that the Critic scores for safety; actions exceeding a confidence threshold are executed otherwise a safer alternative is supplied.
  • Figure 3: visualization of LLM-Cave.
  • Figure 4: System prompt for LLMs.
  • Figure 5: A typical experiment replay. In LLM-Cave, the DeepSeek R1 model effectively guided the agent through a hazardous cave. Starting from a safe region, the agent detected signs of danger—breeze and stench—and inferred the locations of a pit and the Wumpus. It successfully avoided the pit, eliminated the Wumpus, and proceeded to collect the gold, demonstrating accurate reasoning and decision-making under uncertainty.