Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives

Taeho Lee; Donghwan Lee

Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives

Taeho Lee, Donghwan Lee

Abstract

Reinforcement learning (RL) has achieved remarkable success in a wide range of control and decision-making tasks. However, RL agents often exhibit unstable or degraded performance when deployed in environments subject to unexpected external disturbances and model uncertainties. Consequently, ensuring reliable performance under such conditions remains a critical challenge. In this paper, we propose minimax deep deterministic policy gradient (MMDDPG), a framework for learning disturbance-resilient policies in continuous control tasks. The training process is formulated as a minimax optimization problem between a user policy and an adversarial disturbance policy. In this problem, the user learns a robust policy that minimizes the objective function, while the adversary generates disturbances that maximize it. To stabilize this interaction, we introduce a fractional objective that balances task performance and disturbance magnitude. This objective prevents excessively aggressive disturbances and promotes robust learning. Experimental evaluations in MuJoCo environments demonstrate that the proposed MMDDPG achieves significantly improved robustness against both external force perturbations and model parameter variations.

Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives

Abstract

Paper Structure (24 sections, 38 equations, 4 figures, 2 tables, 3 algorithms)

This paper contains 24 sections, 38 equations, 4 figures, 2 tables, 3 algorithms.

Introduction
Related Works
Preliminaries
Two-player zero-sum Markov game
Fractional robust objective
Actor and critic update
Minimax deep deterministic policy gradient
Actor update
Critic update
Exploration
Experiment and results
Experiment setup
Robustness against external disturbances
Robustness to model parameter variations
Results
...and 9 more sections

Figures (4)

Figure 1: Overview of the minimax deep deterministic policy gradient (MMDDPG). Two players, the user and adversarial agents, interact in environment generating the action $a_t$ and the disturbance $w_t$ according to the state $s_t$. The action-value function $Q_{\psi_1}(s,a,w)$ and $Q_{\psi_2}(s,w)$ are updated by the cost $c_{t+1}$ and $w_t$. The policy of user $\pi_\theta$ is updated to minimize the fractional objective function $J^{\pi_\theta,\mu_\phi}$ while the policy of adversary $\mu_\phi$ is updated to maximize it.
Figure 2: Mean and standard deviation of cumulative discounted costs across ten random seeds under random Gaussian disturbances. Error bars indicate one standard deviation. Each row corresponds to a different algorithm: MMDDPG (minimax deep deterministic policy gradient), DDPG (deep deterministic policy gradient) DDPG, RARL (robust adversarial reinforcement learning) RARL, PR-DDPG (probabilistic action-robust DDPG), and NR-DDPG (noisy action-robust DDPG) ARDDPG. While other baseline methods exhibit increased cost and variance as task complexity grows, MMDDPG consistently achieves the lowest average cost with minimal variance across both environments.
Figure 3: Performance heatmaps under model parameter uncertainties in Reacher (top) and Pusher (bottom) environments. The x-axis and y-axis represent the gear scale and joint damping scale, respectively. Darker colors indicate lower mean discounted costs. Each row corresponds to a different algorithm: MMDDPG (minimax deep deterministic policy gradient), DDPG (deep deterministic policy gradient) DDPG, RARL (robust adversarial reinforcement learning) RARL, PR-DDPG (probabilistic action-robust DDPG), and NR-DDPG (noisy action-robust DDPG) ARDDPG. MMDDPG maintains a consistently low-cost region across the entire parameter grid, demonstrating superior robustness to parametric mismatches compared to adversarial and action-robust baselines.
Figure 4: Experiment environments. Left: Reacher, Right Psuher

Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives

Abstract

Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives

Authors

Abstract

Table of Contents

Figures (4)