Table of Contents
Fetching ...

Stochastic Primal-Dual Q-Learning

Narim Jeong, Donghwan Lee, Niao He

TL;DR

This work introduces SPD Q-learning, a model-free, off-policy reinforcement learning algorithm built on a novel linear programming formulation of dynamic programming and a primal-dual perspective. By integrating a Q-function estimation step into the primal-dual LP framework, the method enables policy recovery from both primal and dual solutions and provides convergence guarantees under time-varying state-action distributions. The authors derive explicit sample-complexity bounds and demonstrate empirically that the primal SPD-Q policy can converge faster than its dual counterpart, while maintaining competitive off-policy learning performance. The approach offers a principled pathway to off-policy RL with convergence guarantees and broad potential extensions to safe, distributed, and function-approximation settings.

Abstract

In this work, we present a new model-free and off-policy reinforcement learning (RL) algorithm, that is capable of finding a near-optimal policy with state-action observations from arbitrary behavior policies. Our algorithm, called the stochastic primal-dual Q-learning (SPD Q-learning), hinges upon a new linear programming formulation and a dual perspective of the standard Q-learning. In contrast to previous primal-dual RL algorithms, the SPD Q-learning includes a Q-function estimation step, thus allowing to recover an approximate policy from the primal solution as well as the dual solution. We prove a first-of-its-kind result that the SPD Q-learning guarantees a certain convergence rate, even when the state-action distribution is time-varying but sub-linearly converges to a stationary distribution. Numerical experiments are provided to demonstrate the off-policy learning abilities of the proposed algorithm in comparison to the standard Q-learning.

Stochastic Primal-Dual Q-Learning

TL;DR

This work introduces SPD Q-learning, a model-free, off-policy reinforcement learning algorithm built on a novel linear programming formulation of dynamic programming and a primal-dual perspective. By integrating a Q-function estimation step into the primal-dual LP framework, the method enables policy recovery from both primal and dual solutions and provides convergence guarantees under time-varying state-action distributions. The authors derive explicit sample-complexity bounds and demonstrate empirically that the primal SPD-Q policy can converge faster than its dual counterpart, while maintaining competitive off-policy learning performance. The approach offers a principled pathway to off-policy RL with convergence guarantees and broad potential extensions to safe, distributed, and function-approximation settings.

Abstract

In this work, we present a new model-free and off-policy reinforcement learning (RL) algorithm, that is capable of finding a near-optimal policy with state-action observations from arbitrary behavior policies. Our algorithm, called the stochastic primal-dual Q-learning (SPD Q-learning), hinges upon a new linear programming formulation and a dual perspective of the standard Q-learning. In contrast to previous primal-dual RL algorithms, the SPD Q-learning includes a Q-function estimation step, thus allowing to recover an approximate policy from the primal solution as well as the dual solution. We prove a first-of-its-kind result that the SPD Q-learning guarantees a certain convergence rate, even when the state-action distribution is time-varying but sub-linearly converges to a stationary distribution. Numerical experiments are provided to demonstrate the off-policy learning abilities of the proposed algorithm in comparison to the standard Q-learning.

Paper Structure

This paper contains 18 sections, 22 theorems, 124 equations, 4 figures, 1 algorithm.

Key Result

Lemma 1

The LP eq:DP-LP-form has the unique solution $V^*= (I_{|{\cal S}|}-\alpha P_{\pi^*})^{-1} R_{\pi^*}$.

Figures (4)

  • Figure 1: Evolution of the Q-function error, $\sum_{a\in {\cal A}}{{\|Q_a^*-{\hat{Q}}_{a,T}\|}_\infty}$ for the SPD Q-learning.
  • Figure 2: Evolution of the dual policy error, $\sum_{s\in {\cal S}} {\|\pi_s^*-\hat{\pi}_{s,T}^d\|_\infty}$, from the SPD Q-learning and the dual policy error, $\sum_{s\in {\cal S}} {\|\pi_s^*-\tilde{\pi}_{s,T}\|_\infty}$, from the SPD-RL in chen2016stochastic with the importance sampling.
  • Figure 3: Evolution of the primal policy error, $\sum_{s\in {\cal S}}{\|\pi_s^*-\tilde{\pi}_{s,T}^p\|_\infty}$ (left-hand side), from the SPD Q-learning and the error of the standard Q-learning (right-hand side).
  • Figure 4: Evolution of the average reward corresponding to the primal policy of the SPD Q-learn ing (blue line) and the average reward of the standard Q-learning (green line).

Theorems & Definitions (29)

  • Lemma 1: chen2016stochastic
  • Lemma 2: chen2016stochastic
  • Corollary 1
  • Lemma 3
  • Definition 1: Saddle point bertsekas2003convex
  • Proposition 1: bertsekas2003convex
  • Definition 2: Saddle point problem
  • Lemma 4
  • Lemma 5
  • Proposition 2
  • ...and 19 more