Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

Shengbo Wang

Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

Shengbo Wang

TL;DR

The novel Q-Measure-Learning is proposed, which learns a signed empirical measure supported on visited state-action pairs and reconstructs an action-value estimate via kernel integration and bound the approximation error between this limit and the optimal $Q^*$ as a function of the kernel bandwidth.

Abstract

We study reinforcement learning in infinite-horizon discounted Markov decision processes with continuous state spaces, where data are generated online from a single trajectory under a Markovian behavior policy. To avoid maintaining an infinite-dimensional, function-valued estimate, we propose the novel Q-Measure-Learning, which learns a signed empirical measure supported on visited state-action pairs and reconstructs an action-value estimate via kernel integration. The method jointly estimates the stationary distribution of the behavior chain and the Q-measure through coupled stochastic approximation, leading to an efficient weight-based implementation with $O(n)$ memory and $O(n)$ computation cost per iteration. Under uniform ergodicity of the behavior chain, we prove almost sure sup-norm convergence of the induced Q-function to the fixed point of a kernel-smoothed Bellman operator. We also bound the approximation error between this limit and the optimal $Q^*$ as a function of the kernel bandwidth. To assess the performance of our proposed algorithm, we conduct RL experiments in a two-item inventory control setting.

Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

TL;DR

as a function of the kernel bandwidth.

Abstract

memory and

computation cost per iteration. Under uniform ergodicity of the behavior chain, we prove almost sure sup-norm convergence of the induced Q-function to the fixed point of a kernel-smoothed Bellman operator. We also bound the approximation error between this limit and the optimal

as a function of the kernel bandwidth. To assess the performance of our proposed algorithm, we conduct RL experiments in a two-item inventory control setting.

Paper Structure (22 sections, 14 theorems, 151 equations, 5 figures, 1 algorithm)

This paper contains 22 sections, 14 theorems, 151 equations, 5 figures, 1 algorithm.

Introduction
Related work
Notations and Assumptions
Q-Measure
Algorithm and Efficient Weight-Based Implementation
Q-Measure-Learning
Weight Representation
Convergence
Normalized Gaussian-Kernel Metric
Almost sure convergence
Approximating $Q^*$ with $q^*$
Numerical Experiment
Proof of Proposition \ref{['prop:D_metric']}
Proof of Proposition \ref{['prop:vanishing_err']}
The bias term $S^B_{n,m}$:
...and 7 more sections

Key Result

Lemma 1

For any probability measure $\mu$ on $\mathcal{Z}$ and any $f,g\in C(\mathbb{Z})$, Therefore, the smoothed and clipped Bellman operator $\overline\mathcal{T}_\mu := \mathcal{K}_{\mu}\circ\overline\mathcal{T}:C(\mathbb{Z})\rightarrow C(\mathbb{Z})$ is a $\gamma$-contraction in $\left\|\cdot\right\|$. Hence, it admits a unique fixed point $q_\mu^*\in C(\mathbb{Z})$ such that $\left\

Figures (5)

Figure 1: Left: estimated discounted return of the greedy policy induced by $q_n$. Right: $Q^*$ estimation error, where the curves exhibit stabilization/decrease consistent with convergence.
Figure 2: Greedy policy induced by $q_n$ compared with the DP policy on a grid.
Figure 3: Visitation count under the behavior policy: the entire state space is well-covered.
Figure 4: Q-Measure estimate $q_n(\cdot,a)$ versus the DP benchmark, $a = (0,2)$ and $(0,5)$.
Figure 5: Partial visitation of inventory states: the behavior policy explores only the bottom-left corner of the state space. The Q-measure estimates of $Q^*$ are accurate in the bottom-left region, but are off in the top-right region.

Theorems & Definitions (28)

Lemma 1
proof
Definition 1: The Q-Measure
Definition 2
Proposition 4.1: $D_{\mu}$ is a metric
Remark 1
Proposition 4.2: Convergence of the empirical process
proof
Theorem 1: Convergence of Algorithm \ref{['alg:Q_meas_learning']}
Lemma 2
...and 18 more

Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

TL;DR

Abstract

Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (28)