Regularized Q-Learning with Linear Function Approximation

Jiachen Xi; Alfredo Garcia; Petar Momcilovic

Regularized Q-Learning with Linear Function Approximation

Jiachen Xi, Alfredo Garcia, Petar Momcilovic

TL;DR

We study off-policy Q-learning for regularized MDPs with linear function approximation, where standard projected Bellman operators are not contractive. We introduce a differentiable smooth truncation operator and a bi-level optimization framework that decouples projection from Bellman updates, enabling a single-loop two-timescale algorithm with finite-time convergence guarantees under Markovian noise. The algorithm achieves a convergence rate of $\mathcal{O}(T^{-1/4})$ in gradient norms and provides explicit performance bounds for learned policies that account for approximation error and truncation bias, improving as truncation bias diminishes. Empirical results on GridWorld and MountainCar-v0 illustrate improved MSPBE convergence and policy performance compared to baseline methods, offering practical guidance on truncation threshold and showcasing stability in regularized, linearly-parameterized Q-learning. This work advances reliable off-policy learning for regularized MDPs with linear function approximation and clarifies the trade-offs introduced by smooth truncation in policy optimization.

Abstract

Regularized Markov Decision Processes serve as models of sequential decision making under uncertainty wherein the decision maker has limited information processing capacity and/or aversion to model ambiguity. With functional approximation, the convergence properties of learning algorithms for regularized MDPs (e.g. soft Q-learning) are not well understood because the composition of the regularized Bellman operator and a projection onto the span of basis vectors is not a contraction with respect to any norm. In this paper, we consider a bi-level optimization formulation of regularized Q-learning with linear functional approximation. The {\em lower} level optimization problem aims to identify a value function approximation that satisfies Bellman's recursive optimality condition and the {\em upper} level aims to find the projection onto the span of basis vectors. This formulation motivates a single-loop algorithm with finite time convergence guarantees. The algorithm operates on two time-scales: updates to the projection of state-action values are `slow' in that they are implemented with a step size that is smaller than the one used for `faster' updates of approximate solutions to Bellman's recursive optimality equation. We show that, under certain assumptions, the proposed algorithm converges to a stationary point in the presence of Markovian noise. In addition, we provide a performance guarantee for the policies derived from the proposed algorithm.

Regularized Q-Learning with Linear Function Approximation

TL;DR

in gradient norms and provides explicit performance bounds for learned policies that account for approximation error and truncation bias, improving as truncation bias diminishes. Empirical results on GridWorld and MountainCar-v0 illustrate improved MSPBE convergence and policy performance compared to baseline methods, offering practical guidance on truncation threshold and showcasing stability in regularized, linearly-parameterized Q-learning. This work advances reliable off-policy learning for regularized MDPs with linear function approximation and clarifies the trade-offs introduced by smooth truncation in policy optimization.

Abstract

Paper Structure (18 sections, 8 theorems, 39 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 8 theorems, 39 equations, 2 figures, 1 table, 1 algorithm.

Introduction
Our Contributions
Related Work
Preliminaries
Regularized Markov Decision Process
Linear Function Approximation
Problem Formulation
Smooth Truncation Operator
Bi-level Formulation
The Proposed Algorithm
Finite-Time Guarantees
Convergence Analysis
Performance Analysis
A Discussion on Threshold $\delta$
Numerical Illustration
...and 3 more sections

Key Result

Proposition II.1

Let $G$ be a strongly convex function bounded by $B >0$ and $\tau >0$ be a coefficient associated with $G$. The following hold:

Figures (2)

Figure 1: Rewards and Values Maps.
Figure 2: MSPBE of the estimated state-action value functions. The graph shows the average MSPBE ($\pm$ standard deviation) over $100$ runs.

Theorems & Definitions (11)

Proposition II.1
Proposition III.3
Lemma V.1
Lemma V.2
Theorem V.3
Remark V.4
Remark V.5
Corollary V.6
Theorem V.7
Lemma A1
...and 1 more

Regularized Q-Learning with Linear Function Approximation

TL;DR

Abstract

Regularized Q-Learning with Linear Function Approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (11)