Table of Contents
Fetching ...

Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

Shengbo Wang

TL;DR

The novel Q-Measure-Learning is proposed, which learns a signed empirical measure supported on visited state-action pairs and reconstructs an action-value estimate via kernel integration and bound the approximation error between this limit and the optimal $Q^*$ as a function of the kernel bandwidth.

Abstract

We study reinforcement learning in infinite-horizon discounted Markov decision processes with continuous state spaces, where data are generated online from a single trajectory under a Markovian behavior policy. To avoid maintaining an infinite-dimensional, function-valued estimate, we propose the novel Q-Measure-Learning, which learns a signed empirical measure supported on visited state-action pairs and reconstructs an action-value estimate via kernel integration. The method jointly estimates the stationary distribution of the behavior chain and the Q-measure through coupled stochastic approximation, leading to an efficient weight-based implementation with $O(n)$ memory and $O(n)$ computation cost per iteration. Under uniform ergodicity of the behavior chain, we prove almost sure sup-norm convergence of the induced Q-function to the fixed point of a kernel-smoothed Bellman operator. We also bound the approximation error between this limit and the optimal $Q^*$ as a function of the kernel bandwidth. To assess the performance of our proposed algorithm, we conduct RL experiments in a two-item inventory control setting.

Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

TL;DR

The novel Q-Measure-Learning is proposed, which learns a signed empirical measure supported on visited state-action pairs and reconstructs an action-value estimate via kernel integration and bound the approximation error between this limit and the optimal as a function of the kernel bandwidth.

Abstract

We study reinforcement learning in infinite-horizon discounted Markov decision processes with continuous state spaces, where data are generated online from a single trajectory under a Markovian behavior policy. To avoid maintaining an infinite-dimensional, function-valued estimate, we propose the novel Q-Measure-Learning, which learns a signed empirical measure supported on visited state-action pairs and reconstructs an action-value estimate via kernel integration. The method jointly estimates the stationary distribution of the behavior chain and the Q-measure through coupled stochastic approximation, leading to an efficient weight-based implementation with memory and computation cost per iteration. Under uniform ergodicity of the behavior chain, we prove almost sure sup-norm convergence of the induced Q-function to the fixed point of a kernel-smoothed Bellman operator. We also bound the approximation error between this limit and the optimal as a function of the kernel bandwidth. To assess the performance of our proposed algorithm, we conduct RL experiments in a two-item inventory control setting.
Paper Structure (22 sections, 14 theorems, 151 equations, 5 figures, 1 algorithm)

This paper contains 22 sections, 14 theorems, 151 equations, 5 figures, 1 algorithm.

Key Result

Lemma 1

For any probability measure $\mu$ on $\mathcal{Z}$ and any $f,g\in C(\mathbb{Z})$, Therefore, the smoothed and clipped Bellman operator $\overline\mathcal{T}_\mu := \mathcal{K}_{\mu}\circ\overline\mathcal{T}:C(\mathbb{Z})\rightarrow C(\mathbb{Z})$ is a $\gamma$-contraction in $\left\|\cdot\right\|$. Hence, it admits a unique fixed point $q_\mu^*\in C(\mathbb{Z})$ such that $\left\

Figures (5)

  • Figure 1: Left: estimated discounted return of the greedy policy induced by $q_n$. Right: $Q^*$ estimation error, where the curves exhibit stabilization/decrease consistent with convergence.
  • Figure 2: Greedy policy induced by $q_n$ compared with the DP policy on a grid.
  • Figure 3: Visitation count under the behavior policy: the entire state space is well-covered.
  • Figure 4: Q-Measure estimate $q_n(\cdot,a)$ versus the DP benchmark, $a = (0,2)$ and $(0,5)$.
  • Figure 5: Partial visitation of inventory states: the behavior policy explores only the bottom-left corner of the state space. The Q-measure estimates of $Q^*$ are accurate in the bottom-left region, but are off in the top-right region.

Theorems & Definitions (28)

  • Lemma 1
  • proof
  • Definition 1: The Q-Measure
  • Definition 2
  • Proposition 4.1: $D_{\mu}$ is a metric
  • Remark 1
  • Proposition 4.2: Convergence of the empirical process
  • proof
  • Theorem 1: Convergence of Algorithm \ref{['alg:Q_meas_learning']}
  • Lemma 2
  • ...and 18 more