Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

Yang Li; Xing Chen; Yutao Liu; Gege Qi; Yanxian BI; Zizhe Wang; Yunjian Zhang; Yao Zhu

Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

Yang Li, Xing Chen, Yutao Liu, Gege Qi, Yanxian BI, Zizhe Wang, Yunjian Zhang, Yao Zhu

TL;DR

Results show that strategic intelligence in interactive environments depends not only on reasoning depth, but also on the ability to translate plans into timely actions, positioning STAR as a principled benchmark for studying this trade-off in competitive, dynamic settings.

Abstract

Large Language Models (LLMs) have achieved strong performance on static reasoning benchmarks, yet their effectiveness as interactive agents operating in adversarial, time-sensitive environments remains poorly understood. Existing evaluations largely treat reasoning as a single-shot capability, overlooking the challenges of opponent-aware decision-making, temporal constraints, and execution under pressure. This paper introduces Strategic Tactical Agent Reasoning (STAR) Benchmark, a multi-agent evaluation framework that assesses LLMs through 1v1 zero-sum competitive interactions, framing reasoning as an iterative, adaptive decision-making process. STAR supports both turn-based and real-time settings, enabling controlled analysis of long-horizon strategic planning and fast-paced tactical execution within a unified environment. Built on a modular architecture with a standardized API and fully implemented execution engine, STAR facilitates reproducible evaluation and flexible task customization. To move beyond binary win-loss outcomes, we introduce a Strategic Evaluation Suite that assesses not only competitive success but also the quality of strategic behavior, such as execution efficiency and outcome stability. Extensive pairwise evaluations reveal a pronounced strategy-execution gap: while reasoning-intensive models dominate turn-based settings, their inference latency often leads to inferior performance in real-time scenarios, where faster instruction-tuned models prevail. These results show that strategic intelligence in interactive environments depends not only on reasoning depth, but also on the ability to translate plans into timely actions, positioning STAR as a principled benchmark for studying this trade-off in competitive, dynamic settings.

Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

TL;DR

Abstract

Paper Structure (37 sections, 5 equations, 4 figures, 4 tables)

This paper contains 37 sections, 5 equations, 4 figures, 4 tables.

Introduction
Related Work
Strategic Tactical Agent Reasoning (STAR) Benchmark
Task Formalization
Zero-sum Competitive Task Design
STAR Framework
Extensibility and Customization
Experimental Results
Settings and Metrics
Turn-Based Evaluation
Self-Organization and Protective Rotation.
Coordinated Strikes.
Terrain Exploitation.
Real-Time Evaluation
Visual Perception vs. Reasoning Performance
...and 22 more sections

Figures (4)

Figure 1: Visualization interface of STAR (Strategic Tactical Agent Reasoning Benchmark). STAR is a LLM PvP(Player versus Player) environment with multiple types of maps, evaluating the capability of LLMs to perform iterative reasoning and strategic decision-making in dynamic and zero-sum multi-agent environments.
Figure 2: Overview of the STAR benchmark architecture. The system relies on four decoupled layers designed for high extensibility and interoperability. The Framework Layer serves as the foundational ECS engine, empowering researchers to construct diverse game scenarios in the Environment Layer by reusing core simulation components. Crucially, the Protocol Layer establishes a standardized interface that bridges the Agent Layer with the environment, enabling seamless interconnectivity between heterogeneous agents and any environment built upon the framework.
Figure 3: Trade-off analysis between Spatial Precision (x-axis) and Action Efficiency (y-axis). While VLMs (squares) achieve superior spatial grounding with minimal error, their high inference latency results in low action frequency. Conversely, standard LLMs (circles) sacrifice precision for speed. "Thinking" models (triangles/diamonds) bridge this gap, achieving VLM-level precision through reasoning without the visual processing overhead.
Figure 4: Example JSON Observation (Simplified)

Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

TL;DR

Abstract

Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (4)