Table of Contents
Fetching ...

DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

Wenjie Tang, Yuan Zhou, Erqiang Xu, Keyan Cheng, Minne Li, Liquan Xiao

TL;DR

DSGBench tackles the challenge of evaluating LLM-based agents in complex, dynamic strategic tasks by integrating six diverse games (StarCraft II, Civilization, Street Fighter III, Diplomacy, Werewolf, Stratego) with a fine-grained metric suite spanning five cognitive dimensions and an automated decision-trajectory analysis. It formalizes interactions as POMDPs and proposes a weighted, normalized score T to capture multi-dimensional performance across scenarios, enabling systematic comparison and interpretability. Experimental results reveal distinct specialization across models, with closed-source agents generally outperforming open-source counterparts in strategic planning while real-time decision-making and adaptation remain challenging. The framework’s Gym-based implementation, customizable scenarios, and trajectory analysis offer a scalable path for evaluating and guiding the development of more capable, generalizable LLM-based agents in multi-agent settings.

Abstract

Large Language Model~(LLM) based agents have been increasingly popular in solving complex and dynamic tasks, which requires proper evaluation systems to assess their capabilities. Nevertheless, existing benchmarks usually either focus on single-objective tasks or use overly broad assessing metrics, failing to provide a comprehensive inspection of the actual capabilities of LLM-based agents in complicated decision-making tasks. To address these issues, we introduce DSGBench, a more rigorous evaluation platform for strategic decision-making. Firstly, it incorporates six complex strategic games which serve as ideal testbeds due to their long-term and multi-dimensional decision-making demands and flexibility in customizing tasks of various difficulty levels or multiple targets. Secondly, DSGBench employs a fine-grained evaluation scoring system which examines the decision-making capabilities by looking into the performance in five specific dimensions and offering a comprehensive assessment in a well-designed way. Furthermore, DSGBench also incorporates an automated decision-tracking mechanism which enables in-depth analysis of agent behaviour patterns and the changes in their strategies. We demonstrate the advances of DSGBench by applying it to multiple popular LLM-based agents and our results suggest that DSGBench provides valuable insights in choosing LLM-based agents as well as improving their future development. DSGBench is available at https://github.com/DeciBrain-Group/DSGBench.

DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

TL;DR

DSGBench tackles the challenge of evaluating LLM-based agents in complex, dynamic strategic tasks by integrating six diverse games (StarCraft II, Civilization, Street Fighter III, Diplomacy, Werewolf, Stratego) with a fine-grained metric suite spanning five cognitive dimensions and an automated decision-trajectory analysis. It formalizes interactions as POMDPs and proposes a weighted, normalized score T to capture multi-dimensional performance across scenarios, enabling systematic comparison and interpretability. Experimental results reveal distinct specialization across models, with closed-source agents generally outperforming open-source counterparts in strategic planning while real-time decision-making and adaptation remain challenging. The framework’s Gym-based implementation, customizable scenarios, and trajectory analysis offer a scalable path for evaluating and guiding the development of more capable, generalizable LLM-based agents in multi-agent settings.

Abstract

Large Language Model~(LLM) based agents have been increasingly popular in solving complex and dynamic tasks, which requires proper evaluation systems to assess their capabilities. Nevertheless, existing benchmarks usually either focus on single-objective tasks or use overly broad assessing metrics, failing to provide a comprehensive inspection of the actual capabilities of LLM-based agents in complicated decision-making tasks. To address these issues, we introduce DSGBench, a more rigorous evaluation platform for strategic decision-making. Firstly, it incorporates six complex strategic games which serve as ideal testbeds due to their long-term and multi-dimensional decision-making demands and flexibility in customizing tasks of various difficulty levels or multiple targets. Secondly, DSGBench employs a fine-grained evaluation scoring system which examines the decision-making capabilities by looking into the performance in five specific dimensions and offering a comprehensive assessment in a well-designed way. Furthermore, DSGBench also incorporates an automated decision-tracking mechanism which enables in-depth analysis of agent behaviour patterns and the changes in their strategies. We demonstrate the advances of DSGBench by applying it to multiple popular LLM-based agents and our results suggest that DSGBench provides valuable insights in choosing LLM-based agents as well as improving their future development. DSGBench is available at https://github.com/DeciBrain-Group/DSGBench.

Paper Structure

This paper contains 58 sections, 5 equations, 6 figures, 19 tables.

Figures (6)

  • Figure 1: The overall framework of DSGBench. The framework consists of (1) a multi-game environment supporting both asynchronous and synchronous interactions; (2) fine-grained capability metrics for strategic planning, real-time decision-making, and team collaboration; and (3) decision trajectory tracking tools that collaboratively analyze agents' decision-making processes. Through observation-to-prompt and response-to-action loops, DSGBench enables systematic evaluation of LLM-based agents in dynamic, multi-agent scenarios.
  • Figure 3: Code Architecture of DSGBench Framework.
  • Figure : (a) Strategic Planning - EER
  • Figure : (a) Strategic Planning - EER
  • Figure : (b) Real-Time Decision-Making - EPM
  • ...and 1 more figures