Wasserstein Adaptive Value Estimation for Actor-Critic Reinforcement Learning

Ali Baheri; Zahra Shahrooei; Chirayu Salgarkar

Wasserstein Adaptive Value Estimation for Actor-Critic Reinforcement Learning

Ali Baheri, Zahra Shahrooei, Chirayu Salgarkar

TL;DR

WAVE addresses instability in actor-critic reinforcement learning by adding a Sinkhorn-based adaptive Wasserstein regularization term to the critic loss, penalizing large shifts in consecutive Q-value distributions. The adaptive weight lambda_k is updated from performance signals, yielding a convergence rate for the critic mean squared error of $O(1/k)$ and a parameter convergence rate of $O(1/\,sqrt{k})$, along with an enhanced contraction factor gamma_lambda = gamma*(1 - c*lambda). Theoretical results establish stability and accelerated convergence, while empirical tests on continuous-control tasks show superior performance over standard actor-critic methods. This work bridges optimal transport and RL to improve learning stability and robustness in high-dimensional control problems.

Abstract

We present Wasserstein Adaptive Value Estimation for Actor-Critic (WAVE), an approach to enhance stability in deep reinforcement learning through adaptive Wasserstein regularization. Our method addresses the inherent instability of actor-critic algorithms by incorporating an adaptively weighted Wasserstein regularization term into the critic's loss function. We prove that WAVE achieves $\mathcal{O}\left(\frac{1}{k}\right)$ convergence rate for the critic's mean squared error and provide theoretical guarantees for stability through Wasserstein-based regularization. Using the Sinkhorn approximation for computational efficiency, our approach automatically adjusts the regularization based on the agent's performance. Theoretical analysis and experimental results demonstrate that WAVE achieves superior performance compared to standard actor-critic methods.

Wasserstein Adaptive Value Estimation for Actor-Critic Reinforcement Learning

TL;DR

and a parameter convergence rate of

, along with an enhanced contraction factor gamma_lambda = gamma*(1 - c*lambda). Theoretical results establish stability and accelerated convergence, while empirical tests on continuous-control tasks show superior performance over standard actor-critic methods. This work bridges optimal transport and RL to improve learning stability and robustness in high-dimensional control problems.

Abstract

convergence rate for the critic's mean squared error and provide theoretical guarantees for stability through Wasserstein-based regularization. Using the Sinkhorn approximation for computational efficiency, our approach automatically adjusts the regularization based on the agent's performance. Theoretical analysis and experimental results demonstrate that WAVE achieves superior performance compared to standard actor-critic methods.

Paper Structure (7 sections, 36 equations, 2 figures, 1 algorithm)

This paper contains 7 sections, 36 equations, 2 figures, 1 algorithm.

Introduction
Methodology
Main Theoretical Results
Numerical Results
Proof of Theorems
Discussion
Conclusion

Figures (2)

Figure 1: Comparison of cumulative average rewards between WAVE and baseline across three continuous control environments: Inverted Pendulum (left), Acrobot (middle), and 2D Robot Navigation (right).
Figure 2: Evolution of the adaptive regularization parameter during training on the Inverted Pendulum environment.

Wasserstein Adaptive Value Estimation for Actor-Critic Reinforcement Learning

TL;DR

Abstract

Wasserstein Adaptive Value Estimation for Actor-Critic Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (2)