Table of Contents
Fetching ...

Distributionally Robust Deep Q-Learning

Chung I Lu, Julian Sester, Aijia Zhang

TL;DR

The paper addresses learning robust policies for continuous-state MDPs under transition-model uncertainty. It introduces a distributionally robust Q-learning framework that uses Sinkhorn distance-based ambiguity sets around a reference measure, and solves the robust Bellman equation via dualisation. By parameterising the robust Q-function with neural networks, the authors derive Robust DQN (RDQN), a practical algorithm that modifies the target computation and training objective to optimize for worst-case state transitions. Theoretical results guarantee the dynamic programming principle and the existence of solutions under compact state spaces, and empirical studies on a toy gambling task and S&P 500 portfolio optimization illustrate improved tail performance and risk-adjusted returns under distributional shifts. The work advances robust reinforcement learning by providing a tractable, scalable approach that explicitly accounts for model misspecification in continuous-state settings, with clear implications for finance and other risk-sensitive domains.

Abstract

We propose a novel distributionally robust $Q$-learning algorithm for the non-tabular case accounting for continuous state spaces where the state transition of the underlying Markov decision process is subject to model uncertainty. The uncertainty is taken into account by considering the worst-case transition from a ball around a reference probability measure. To determine the optimal policy under the worst-case state transition, we solve the associated non-linear Bellman equation by dualising and regularising the Bellman operator with the Sinkhorn distance, which is then parameterized with deep neural networks. This approach allows us to modify the Deep Q-Network algorithm to optimise for the worst case state transition. We illustrate the tractability and effectiveness of our approach through several applications, including a portfolio optimisation task based on S\&{P}~500 data.

Distributionally Robust Deep Q-Learning

TL;DR

The paper addresses learning robust policies for continuous-state MDPs under transition-model uncertainty. It introduces a distributionally robust Q-learning framework that uses Sinkhorn distance-based ambiguity sets around a reference measure, and solves the robust Bellman equation via dualisation. By parameterising the robust Q-function with neural networks, the authors derive Robust DQN (RDQN), a practical algorithm that modifies the target computation and training objective to optimize for worst-case state transitions. Theoretical results guarantee the dynamic programming principle and the existence of solutions under compact state spaces, and empirical studies on a toy gambling task and S&P 500 portfolio optimization illustrate improved tail performance and risk-adjusted returns under distributional shifts. The work advances robust reinforcement learning by providing a tractable, scalable approach that explicitly accounts for model misspecification in continuous-state settings, with clear implications for finance and other risk-sensitive domains.

Abstract

We propose a novel distributionally robust -learning algorithm for the non-tabular case accounting for continuous state spaces where the state transition of the underlying Markov decision process is subject to model uncertainty. The uncertainty is taken into account by considering the worst-case transition from a ball around a reference probability measure. To determine the optimal policy under the worst-case state transition, we solve the associated non-linear Bellman equation by dualising and regularising the Bellman operator with the Sinkhorn distance, which is then parameterized with deep neural networks. This approach allows us to modify the Deep Q-Network algorithm to optimise for the worst case state transition. We illustrate the tractability and effectiveness of our approach through several applications, including a portfolio optimisation task based on S\&{P}~500 data.

Paper Structure

This paper contains 34 sections, 15 theorems, 106 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Proposition 2.7

Let $\varepsilon > 0, \delta \geq 0$ and assume that Assumptions asu_1, asu_3 and asu_2 hold, then for the operator $\mathcal{T}_\delta$ being defined as

Figures (4)

  • Figure 1: The worst case cumulative distribution function for different values of $\delta$ with $\nu=$ Uniform$(0,1)$
  • Figure 2: The worst case cumulative distribution function for different values of $\delta$ with $\nu=$ Beta$(1,5)$
  • Figure 3: The expected reward for each action $a_{t+1}$ based on the current state $x_t$ and action $a_t$ under $\mathbb{P}(x_t,a_t)$. The x-axis is the state $x$ and the y-axis is the expected reward for the action taken.
  • Figure 4: RDQN agents trained and evaluated with varied transaction cost on the S&P 500 index from 3 Jan 1995 to 28 Dec 2023. Both agents use $\varepsilon=0.0025,\delta=0.0001$ with the same seed. Within each figure, the top graph shows the wealth trajectory of the RDQN agent (blue line) and the S&P 500 index (orange line). The middle graph shows the positions taken and the bottom graph shows the maximum drawdown of the RDQN agent (blue line) and the S&P 500 index (orange line).

Theorems & Definitions (35)

  • Definition 2.1: Wasserstein-distance
  • Definition 2.2: Sinkhorn distance
  • Remark 2.3: On the role of $\nu$
  • Proposition 2.7: Robust Bellman equation for the Sinkhorn ball ambiguity set
  • proof
  • Proposition 3.1
  • proof
  • Corollary 3.2
  • proof
  • Proposition 3.3: Universal approximation theorem
  • ...and 25 more