Table of Contents
Fetching ...

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

Jianzhu Yao, Kevin Wang, Ryan Hsieh, Haisu Zhou, Tianqing Zou, Zerui Cheng, Zhangyang Wang, Pramod Viswanath

TL;DR

SPIN-Bench tackles the assessment of large language models on long-horizon strategic planning and social reasoning by unifying formal planning (PDDL), competitive board games, cooperative Hanabi, and negotiation-intensive Diplomacy within a single benchmark and arena. The framework systematically varies action spaces, state complexity, and agent count to reveal where contemporary LLMs struggle with deep multi-hop reasoning and social coordination under uncertainty. Core contributions include a comprehensive benchmark construction, evaluation metrics spanning rule-based and negotiation-specific dimensions, and extensive experiments showing bottlenecks in planning depth and social intelligence, as well as how negotiation can impact chain-of-thought coherence. The work highlights the need for integrated planning modules and advanced training paradigms to enable robust multi-agent AI and effective human–AI teaming in socially complex environments.

Abstract

Reasoning and strategic behavior in social interactions is a hallmark of intelligence. This form of reasoning is significantly more sophisticated than isolated planning or reasoning tasks in static settings (e.g., math problem solving). In this paper, we present Strategic Planning, Interaction, and Negotiation (SPIN-Bench), a new multi-domain evaluation designed to measure the intelligence of strategic planning and social reasoning. While many existing benchmarks focus on narrow planning or single-agent reasoning, SPIN-Bench combines classical PDDL tasks, competitive board games, cooperative card games, and multi-agent negotiation scenarios in one unified framework. The framework includes both a benchmark as well as an arena to simulate and evaluate the variety of social settings to test reasoning and strategic behavior of AI agents. We formulate the benchmark SPIN-Bench by systematically varying action spaces, state complexity, and the number of interacting agents to simulate a variety of social settings where success depends on not only methodical and step-wise decision making, but also conceptual inference of other (adversarial or cooperative) participants. Our experiments reveal that while contemporary LLMs handle basic fact retrieval and short-range planning reasonably well, they encounter significant performance bottlenecks in tasks requiring deep multi-hop reasoning over large state spaces and socially adept coordination under uncertainty. We envision SPIN-Bench as a catalyst for future research on robust multi-agent planning, social reasoning, and human--AI teaming. Project Website: https://spinbench.github.io/

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

TL;DR

SPIN-Bench tackles the assessment of large language models on long-horizon strategic planning and social reasoning by unifying formal planning (PDDL), competitive board games, cooperative Hanabi, and negotiation-intensive Diplomacy within a single benchmark and arena. The framework systematically varies action spaces, state complexity, and agent count to reveal where contemporary LLMs struggle with deep multi-hop reasoning and social coordination under uncertainty. Core contributions include a comprehensive benchmark construction, evaluation metrics spanning rule-based and negotiation-specific dimensions, and extensive experiments showing bottlenecks in planning depth and social intelligence, as well as how negotiation can impact chain-of-thought coherence. The work highlights the need for integrated planning modules and advanced training paradigms to enable robust multi-agent AI and effective human–AI teaming in socially complex environments.

Abstract

Reasoning and strategic behavior in social interactions is a hallmark of intelligence. This form of reasoning is significantly more sophisticated than isolated planning or reasoning tasks in static settings (e.g., math problem solving). In this paper, we present Strategic Planning, Interaction, and Negotiation (SPIN-Bench), a new multi-domain evaluation designed to measure the intelligence of strategic planning and social reasoning. While many existing benchmarks focus on narrow planning or single-agent reasoning, SPIN-Bench combines classical PDDL tasks, competitive board games, cooperative card games, and multi-agent negotiation scenarios in one unified framework. The framework includes both a benchmark as well as an arena to simulate and evaluate the variety of social settings to test reasoning and strategic behavior of AI agents. We formulate the benchmark SPIN-Bench by systematically varying action spaces, state complexity, and the number of interacting agents to simulate a variety of social settings where success depends on not only methodical and step-wise decision making, but also conceptual inference of other (adversarial or cooperative) participants. Our experiments reveal that while contemporary LLMs handle basic fact retrieval and short-range planning reasonably well, they encounter significant performance bottlenecks in tasks requiring deep multi-hop reasoning over large state spaces and socially adept coordination under uncertainty. We envision SPIN-Bench as a catalyst for future research on robust multi-agent planning, social reasoning, and human--AI teaming. Project Website: https://spinbench.github.io/

Paper Structure

This paper contains 80 sections, 3 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Overview of the Strategic Planning, Interaction, and Negotiation (SPIN-Bench) framework, highlighting its two core components: (1) the Game Agent, which encompasses the LLMs and their adaptive prompting, and (2) the Environment and Evaluation subsystem, which manage game logic, track interactions, and quantify performance.
  • Figure 2: Heatmap displaying the F1 scores across evaluation categories in Diplomacy.
  • Figure 4: Hanabi score distribution by player count (54,977 games)
  • Figure 5: Sample visualization from our PDDL visualizer.
  • Figure 6: Evaluation of LLM performance in retrieving specific states from full-information trajectories. Each dot indicates the average accuracy for an individual task setting, computed over trajectories with lengths ranging from 1 to 50.
  • ...and 5 more figures