Table of Contents
Fetching ...

Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives

Taeho Lee, Donghwan Lee

Abstract

Reinforcement learning (RL) has achieved remarkable success in a wide range of control and decision-making tasks. However, RL agents often exhibit unstable or degraded performance when deployed in environments subject to unexpected external disturbances and model uncertainties. Consequently, ensuring reliable performance under such conditions remains a critical challenge. In this paper, we propose minimax deep deterministic policy gradient (MMDDPG), a framework for learning disturbance-resilient policies in continuous control tasks. The training process is formulated as a minimax optimization problem between a user policy and an adversarial disturbance policy. In this problem, the user learns a robust policy that minimizes the objective function, while the adversary generates disturbances that maximize it. To stabilize this interaction, we introduce a fractional objective that balances task performance and disturbance magnitude. This objective prevents excessively aggressive disturbances and promotes robust learning. Experimental evaluations in MuJoCo environments demonstrate that the proposed MMDDPG achieves significantly improved robustness against both external force perturbations and model parameter variations.

Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives

Abstract

Reinforcement learning (RL) has achieved remarkable success in a wide range of control and decision-making tasks. However, RL agents often exhibit unstable or degraded performance when deployed in environments subject to unexpected external disturbances and model uncertainties. Consequently, ensuring reliable performance under such conditions remains a critical challenge. In this paper, we propose minimax deep deterministic policy gradient (MMDDPG), a framework for learning disturbance-resilient policies in continuous control tasks. The training process is formulated as a minimax optimization problem between a user policy and an adversarial disturbance policy. In this problem, the user learns a robust policy that minimizes the objective function, while the adversary generates disturbances that maximize it. To stabilize this interaction, we introduce a fractional objective that balances task performance and disturbance magnitude. This objective prevents excessively aggressive disturbances and promotes robust learning. Experimental evaluations in MuJoCo environments demonstrate that the proposed MMDDPG achieves significantly improved robustness against both external force perturbations and model parameter variations.
Paper Structure (24 sections, 38 equations, 4 figures, 2 tables, 3 algorithms)

This paper contains 24 sections, 38 equations, 4 figures, 2 tables, 3 algorithms.

Figures (4)

  • Figure 1: Overview of the minimax deep deterministic policy gradient (MMDDPG). Two players, the user and adversarial agents, interact in environment generating the action $a_t$ and the disturbance $w_t$ according to the state $s_t$. The action-value function $Q_{\psi_1}(s,a,w)$ and $Q_{\psi_2}(s,w)$ are updated by the cost $c_{t+1}$ and $w_t$. The policy of user $\pi_\theta$ is updated to minimize the fractional objective function $J^{\pi_\theta,\mu_\phi}$ while the policy of adversary $\mu_\phi$ is updated to maximize it.
  • Figure 2: Mean and standard deviation of cumulative discounted costs across ten random seeds under random Gaussian disturbances. Error bars indicate one standard deviation. Each row corresponds to a different algorithm: MMDDPG (minimax deep deterministic policy gradient), DDPG (deep deterministic policy gradient) DDPG, RARL (robust adversarial reinforcement learning) RARL, PR-DDPG (probabilistic action-robust DDPG), and NR-DDPG (noisy action-robust DDPG) ARDDPG. While other baseline methods exhibit increased cost and variance as task complexity grows, MMDDPG consistently achieves the lowest average cost with minimal variance across both environments.
  • Figure 3: Performance heatmaps under model parameter uncertainties in Reacher (top) and Pusher (bottom) environments. The x-axis and y-axis represent the gear scale and joint damping scale, respectively. Darker colors indicate lower mean discounted costs. Each row corresponds to a different algorithm: MMDDPG (minimax deep deterministic policy gradient), DDPG (deep deterministic policy gradient) DDPG, RARL (robust adversarial reinforcement learning) RARL, PR-DDPG (probabilistic action-robust DDPG), and NR-DDPG (noisy action-robust DDPG) ARDDPG. MMDDPG maintains a consistently low-cost region across the entire parameter grid, demonstrating superior robustness to parametric mismatches compared to adversarial and action-robust baselines.
  • Figure 4: Experiment environments. Left: Reacher, Right Psuher