LLM-PySC2: Starcraft II learning environment for Large Language Models
Zongyuan Li, Yanan Ni, Runnan Qi, Lumin Jiang, Chang Lu, Xiaojie Xu, Xiangbei Liu, Pengfei Li, Yunzheng Guo, Zhe Ma, Huanyu Li, Hui Wu, Xian Guo, Kuihua Huang, Xuebo Zhang
TL;DR
The paper tackles the challenge of enabling large language models (LLMs) to make decisions in StarCraft II by introducing LLM-PySC2, an environment that exposes the full pysc2 action space, rich multi-modal observations, and a native multi-agent framework. It presents an asynchronous query architecture and task-specific Wiki knowledge integration to support LLM-based planning, learning, and collaboration. Through macro-decision and micro-operation experiments, the authors show that while LLMs have zero-shot decision-making potential, their performance remains inconsistent due to insufficient domain knowledge and hallucinations, necessitating task-aware instructions and deployment-time learning. Overall, LLM-PySC2 provides a scalable platform to probe and advance LLM-based decision-making in highly complex, multi-agent environments, guiding future research toward more robust, knowledge-grounded planning systems.
Abstract
The tremendous potential has been demonstrated by large language models (LLMs) in intelligent decision-making problems, with unprecedented capabilities shown across diverse applications ranging from gaming AI systems to complex strategic planning frameworks. However, the StarCraft II platform, which has been widely adopted for validating decision-making algorithms in the past decade, has not yet provided substantial support for this emerging domain. To address issues that LLMs cannot interface with the hundreds of actions of the pysc2 backend and the lack of native support for multi-agent (MA) collaboration, we propose the LLM-PySC2 environment. This is the first environment that offers LLMs the complete pysc2 action space with sufficient multi-modal information and game Wiki knowledge. With an asynchronous query architecture, the environment efficiently interacts with LLMs that maintain a constant latency regardless of the scale of the agents' population. In the experiments, we evaluated LLMs' decision-making performance in both the macro-decision and micro-operation scenarios, with traditional StarCraft II Multi-Agent Challenge (SMAC) tasks and a series of new proposed. Results indicate that LLMs possess the potential to achieve victories in complex scenarios but cannot constantly generate correct decisions, especially in the recovered pysc2 action space and MA settings. Without task-relevant instructions, the pre-trained models suffer from issues such as hallucinations and inefficient collaboration. Our findings suggest that StarCraft II still challenges in the era of large models, revealing that there is a lot to do to develop an advanced LLM decision-making system, and the proposed LLM-PySC2 environment will support future development of LLM-based decision-making solutions.
