Table of Contents
Fetching ...

Episodic Exploration for Deep Deterministic Policies: An Application to StarCraft Micromanagement Tasks

Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, Soumith Chintala

TL;DR

This work introduces StarCraft micromanagement tasks as challenging RL benchmarks with large state/action spaces and no straightforward state representations. It proposes a greedy-inference framework to reduce action-space complexity and a novel zero-order gradient exploration that perturbs the policy's last layer while updating the rest via backpropagation, enabling efficient learning of coordinated multi-unit strategies. A deep state-action scoring network with pooling handles variable unit counts and complex inter-unit relationships, demonstrating that zero-order exploration often outperforms traditional DQN and REINFORCE baselines on several maps and generalizes across related scenarios. The results highlight robust, sample-efficient learning of coordination strategies like focused fire and management of overkill, and point to future work in self-play, richer unit types, and broader applicability of the exploration method.

Abstract

We consider scenarios from the real-time strategy game StarCraft as new benchmarks for reinforcement learning algorithms. We propose micromanagement tasks, which present the problem of the short-term, low-level control of army members during a battle. From a reinforcement learning point of view, these scenarios are challenging because the state-action space is very large, and because there is no obvious feature representation for the state-action evaluation function. We describe our approach to tackle the micromanagement scenarios with deep neural network controllers from raw state features given by the game engine. In addition, we present a heuristic reinforcement learning algorithm which combines direct exploration in the policy space and backpropagation. This algorithm allows for the collection of traces for learning using deterministic policies, which appears much more efficient than, for example, ε-greedy exploration. Experiments show that with this algorithm, we successfully learn non-trivial strategies for scenarios with armies of up to 15 agents, where both Q-learning and REINFORCE struggle.

Episodic Exploration for Deep Deterministic Policies: An Application to StarCraft Micromanagement Tasks

TL;DR

This work introduces StarCraft micromanagement tasks as challenging RL benchmarks with large state/action spaces and no straightforward state representations. It proposes a greedy-inference framework to reduce action-space complexity and a novel zero-order gradient exploration that perturbs the policy's last layer while updating the rest via backpropagation, enabling efficient learning of coordinated multi-unit strategies. A deep state-action scoring network with pooling handles variable unit counts and complex inter-unit relationships, demonstrating that zero-order exploration often outperforms traditional DQN and REINFORCE baselines on several maps and generalizes across related scenarios. The results highlight robust, sample-efficient learning of coordination strategies like focused fire and management of overkill, and point to future work in self-play, richer unit types, and broader applicability of the exploration method.

Abstract

We consider scenarios from the real-time strategy game StarCraft as new benchmarks for reinforcement learning algorithms. We propose micromanagement tasks, which present the problem of the short-term, low-level control of army members during a battle. From a reinforcement learning point of view, these scenarios are challenging because the state-action space is very large, and because there is no obvious feature representation for the state-action evaluation function. We describe our approach to tackle the micromanagement scenarios with deep neural network controllers from raw state features given by the game engine. In addition, we present a heuristic reinforcement learning algorithm which combines direct exploration in the policy space and backpropagation. This algorithm allows for the collection of traces for learning using deterministic policies, which appears much more efficient than, for example, ε-greedy exploration. Experiments show that with this algorithm, we successfully learn non-trivial strategies for scenarios with armies of up to 15 agents, where both Q-learning and REINFORCE struggle.

Paper Structure

This paper contains 25 sections, 10 equations, 1 figure, 2 tables, 1 algorithm.

Figures (1)

  • Figure 1: Example of the training uncertainty (one standard deviation) on 5 different initialization for DQN (left) and zero-order (right) on the m5v5 scenario.