MAGICS: Adversarial RL with Minimax Actors Guided by Implicit Critic Stackelberg for Convergent Neural Synthesis of Robot Safety

Justin Wang; Haimin Hu; Duy Phuong Nguyen; Jaime Fernández Fisac

MAGICS: Adversarial RL with Minimax Actors Guided by Implicit Critic Stackelberg for Convergent Neural Synthesis of Robot Safety

Justin Wang, Haimin Hu, Duy Phuong Nguyen, Jaime Fernández Fisac

TL;DR

The paper addresses the lack of convergence guarantees in neural safety synthesis for high‑dimensional robots by introducing MAGICS, a three‑player Stackelberg–minimax RL algorithm with an implicit critic. By casting training as a Stackelberg game and applying a discounted Isaacs formulation for reach–avoid safety, the authors prove local convergence of the learning dynamics to a differential Stackelberg equilibrium and extend these results to -Safety for high‑dimensional systems. Empirically, MAGICS outperforms state‑of‑the‑art neural safety methods in OpenAI Gym tasks and a 36‑D quadruped hardware experiment, demonstrating robust safety under adversarial disturbances. This work provides a scalable, provably convergent framework for neural safety synthesis with practical impact on safe, robust robotic control.

Abstract

While robust optimal control theory provides a rigorous framework to compute robot control policies that are provably safe, it struggles to scale to high-dimensional problems, leading to increased use of deep learning for tractable synthesis of robot safety. Unfortunately, existing neural safety synthesis methods often lack convergence guarantees and solution interpretability. In this paper, we present Minimax Actors Guided by Implicit Critic Stackelberg (MAGICS), a novel adversarial reinforcement learning (RL) algorithm that guarantees local convergence to a minimax equilibrium solution. We then build on this approach to provide local convergence guarantees for a general deep RL-based robot safety synthesis algorithm. Through both simulation studies on OpenAI Gym environments and hardware experiments with a 36-dimensional quadruped robot, we show that MAGICS can yield robust control policies outperforming the state-of-the-art neural safety synthesis methods.

MAGICS: Adversarial RL with Minimax Actors Guided by Implicit Critic Stackelberg for Convergent Neural Synthesis of Robot Safety

TL;DR

Abstract

Paper Structure (10 sections, 6 theorems, 21 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 10 sections, 6 theorems, 21 equations, 5 figures, 4 tables, 2 algorithms.

Introduction
Related Work
Preliminaries and Problem Formulation
Approach: Stackelberg--Minimax Adversarial RL
Convergent Neural Synthesis of Robot Safety
Experiments
Simulated Examples: Robust Control in OpenAI Gym
Hardware Demonstration: Safe Quadrupedal Locomotion
Limitations and Future Work
Conclusions

Key Result

theorem thmcountertheorem

Given a Markov game with actor parameters $({\theta}, {\psi})$ and shared critic parameters ${\omega}$, if critic has objective function ${L}({\omega}, {\theta}, {\psi})$ defined in eq:a2c_critic, then ${\nabla}_{\theta} {L}({\omega}, {\theta}, {\psi})$ is given by

Figures (5)

Figure 1: Comparing to a non-game baseline robust , our proposed game-theoretic adversarial RL algorithm yields a control policy consistently more robust when applied to safe quadrupedal locomotion and stress-tested with varying tugging forces.
Figure 2: The Pendulum and Half Cheetah environments in OpenAI Gym brockman2016openai, with control and disturbance actions represented in blue and red arrows, respectively. The Pendulum's control inputs are one-dimensional torques applied to the end of the rod in opposition to each other. The Half Cheetah has a six-dimensional control input on the notated joints, with the disturbance acting to destabilize the cheetah through additional torques on its paws.
Figure 3: Snapshots of the Half Cheetah controlled by - and baseline-. Despite an excessively large disturbance torque, the - policy manage to flip the robot back upright and resumed normal gaits. In contrast, the baseline- policy is unable to recover the robot from the overturn; it moved awkwardly on its face and back, wiggling its feet.
Figure 4: Cumulative reward curves across five seeds of MAGICS- (blue) and baseline- (orange) for the adversarial Half Cheetah environment. MAGICS- converges to an equilibrium that outperforms the converged baseline equilibrium by $\sim2.7$ times. Dashed lines represent exploiter disturbances against the same controller color.
Figure 5: Time evolution of the human's tugging forces (disturbance) with -Safety and the baseline. Both policies are trained in simulation with a maximum of 50 N tugging force disturbance. The -Safety policy is robust against the varying tugging forces from different angles, while the baseline failed even with tugging forces of smaller magnitude.

Theorems & Definitions (11)

definition thmcounterdefinition: Local Stackelberg Equilibrium
definition thmcounterdefinition: Strict Local Minimax Equilibrium fiez2020implicitfiez2021globaljin2020local
remark thmcounterremark
theorem thmcountertheorem
lemma thmcounterlemma: Robust Stability of DSE under
proof
lemma thmcounterlemma: Instability of Spurious Critical Points under
theorem thmcountertheorem: Convergence of
proof
proposition thmcounterproposition
...and 1 more

MAGICS: Adversarial RL with Minimax Actors Guided by Implicit Critic Stackelberg for Convergent Neural Synthesis of Robot Safety

TL;DR

Abstract

MAGICS: Adversarial RL with Minimax Actors Guided by Implicit Critic Stackelberg for Convergent Neural Synthesis of Robot Safety

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (11)