Table of Contents
Fetching ...

Optimal Transport-Assisted Risk-Sensitive Q-Learning

Zahra Shahrooei, Ali Baheri

TL;DR

A risk-sensitive Q-learning algorithm that leverages optimal transport theory to enhance the agent safety and achieves faster convergence to a stable policy compared to the traditional Q-learning algorithm is presented.

Abstract

The primary goal of reinforcement learning is to develop decision-making policies that prioritize optimal performance without considering risk or safety. In contrast, safe reinforcement learning aims to mitigate or avoid unsafe states. This paper presents a risk-sensitive Q-learning algorithm that leverages optimal transport theory to enhance the agent safety. By integrating optimal transport into the Q-learning framework, our approach seeks to optimize the policy's expected return while minimizing the Wasserstein distance between the policy's stationary distribution and a predefined risk distribution, which encapsulates safety preferences from domain experts. We validate the proposed algorithm in a Gridworld environment. The results indicate that our method significantly reduces the frequency of visits to risky states and achieves faster convergence to a stable policy compared to the traditional Q-learning algorithm.

Optimal Transport-Assisted Risk-Sensitive Q-Learning

TL;DR

A risk-sensitive Q-learning algorithm that leverages optimal transport theory to enhance the agent safety and achieves faster convergence to a stable policy compared to the traditional Q-learning algorithm is presented.

Abstract

The primary goal of reinforcement learning is to develop decision-making policies that prioritize optimal performance without considering risk or safety. In contrast, safe reinforcement learning aims to mitigate or avoid unsafe states. This paper presents a risk-sensitive Q-learning algorithm that leverages optimal transport theory to enhance the agent safety. By integrating optimal transport into the Q-learning framework, our approach seeks to optimize the policy's expected return while minimizing the Wasserstein distance between the policy's stationary distribution and a predefined risk distribution, which encapsulates safety preferences from domain experts. We validate the proposed algorithm in a Gridworld environment. The results indicate that our method significantly reduces the frequency of visits to risky states and achieves faster convergence to a stable policy compared to the traditional Q-learning algorithm.
Paper Structure (8 sections, 5 equations, 4 figures, 1 algorithm)

This paper contains 8 sections, 5 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: Gridworld environment
  • Figure 2: Average return values across $500$ episodes for $5$ random seeds.
  • Figure 3: Average episode length for $5$ random seeds.
  • Figure 4: Number of obstacle collisions over $500$ episodes for $5$ random seeds.