DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

Wentse Chen; Shiyu Huang; Yuan Chiang; Tim Pearce; Wei-Wei Tu; Ting Chen; Jun Zhu

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

Wentse Chen, Shiyu Huang, Yuan Chiang, Tim Pearce, Wei-Wei Tu, Ting Chen, Jun Zhu

TL;DR

DGPO addresses the challenge of learning multiple high-quality RL strategies by introducing an information-theoretic diversity objective based on a latent code $z$ and a discriminator estimating $p(z|s)$. It casts learning as two constrained optimization problems—maximizing extrinsic return with a diversity constraint and then maximizing diversity under a performance constraint—solved via probabilistic inference within a shared, on-policy PPO-style network with latent conditioning. Empirical results across MPE, Atari, and StarCraft II show that DGPO achieves competitive rewards while discovering richer, more robust strategy sets and often improves sample efficiency relative to baselines like RSPO. This approach enhances the practical impact of RL by enabling diverse behavioral strategies without training multiple networks, with potential benefits for robustness, user engagement, and adversarial settings.

Abstract

Most reinforcement learning algorithms seek a single optimal strategy that solves a given task. However, it can often be valuable to learn a diverse set of solutions, for instance, to make an agent's interaction with users more engaging, or improve the robustness of a policy to an unexpected perturbance. We propose Diversity-Guided Policy Optimization (DGPO), an on-policy algorithm that discovers multiple strategies for solving a given task. Unlike prior work, it achieves this with a shared policy network trained over a single run. Specifically, we design an intrinsic reward based on an information-theoretic diversity objective. Our final objective alternately constraints on the diversity of the strategies and on the extrinsic reward. We solve the constrained optimization problem by casting it as a probabilistic inference task and use policy iteration to maximize the derived lower bound. Experimental results show that our method efficiently discovers diverse strategies in a wide variety of reinforcement learning tasks. Compared to baseline methods, DGPO achieves comparable rewards, while discovering more diverse strategies, and often with better sample efficiency.

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

TL;DR

DGPO addresses the challenge of learning multiple high-quality RL strategies by introducing an information-theoretic diversity objective based on a latent code

and a discriminator estimating

. It casts learning as two constrained optimization problems—maximizing extrinsic return with a diversity constraint and then maximizing diversity under a performance constraint—solved via probabilistic inference within a shared, on-policy PPO-style network with latent conditioning. Empirical results across MPE, Atari, and StarCraft II show that DGPO achieves competitive rewards while discovering richer, more robust strategy sets and often improves sample efficiency relative to baselines like RSPO. This approach enhances the practical impact of RL by enabling diverse behavioral strategies without training multiple networks, with potential benefits for robustness, user engagement, and adversarial settings.

Abstract

Paper Structure (25 sections, 23 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 23 equations, 11 figures, 4 tables, 1 algorithm.

Introduction
Related Works
Reinforcement Learning as Probabilistic Graphical Model
Diversity in Reinforcement Learning
Preliminaries
Methodology
Diversity Measurement
Stage 1: Diversity-Constrained Optimization
Stage 2: Extrinsic-Reward-Constrained Optimization
Diversity-Guided Policy Optimization
Experiments
Multi-Agent Particle Environment
Atari
StarCraft II
Ablation Study
...and 10 more sections

Figures (11)

Figure 1: (a) The graphical model of MDPs. (b) The graphical model of diverse MDPs. Grey nodes are observed, and white nodes are hidden. As introduced in levine2018reinforcement, $\mathcal{O}_t$ is a binary random variable, where $\mathcal{O}_t=1$ denotes that the action is optimal at time $t$, and $\mathcal{O}_t=0$ denotes that the action is not optimal.
Figure 2: The overall framework of the DGPO algorithm. Top illustrates the way of calculating $r^{total}_t$, where $mask_r=\mathds{I}[J(\theta) \geq R_{target}]$ and $mask_d=\mathds{I}[J_{Div}(\theta)\geq\delta]$. Center shows the network structure and the data flow of the DGPO algorithm. Bottom shows the latent variable sampling process.
Figure 3: Experimental results in two MPE scenarios -- Spread (easy) and Spread (hard) -- each with multiple optimal solutions. (a) Plot showing extrinsic reward performance vs. how diverse the set of discovered strategies are. Positions in the upper–right corner are preferred -- DGPO is located here. (b) Plot showing at which point in training each optimal strategy is discovered. Results show that only DGPO and RSPO can find all the solutions. But DGPO achieved over $1.7\times$ and $15\times$ speedup in convergence speed compared to RSPO in the Spread (easy) and Spread (hard) scenarios, respectively.
Figure 4: Plots showing extrinsic reward performance vs. the diversity of the set of discovered strategies. (a) In two Atari games. (b) In two SMAC scenarios.
Figure 5: The initial state of Spread (easy) and Spread (hard). In both scenarios, Agents (orange dots) aim to reach one of the destinies (blue dots). We highlight the optimal solutions with arrows of different colors.
...and 6 more figures

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

TL;DR

Abstract

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (11)