Table of Contents
Fetching ...

Online Statistical Inference of Constant Sample-averaged Q-Learning

Saunak Kumar Panda, Tong Li, Ruiqi Liu, Yisha Xiang

Abstract

Reinforcement learning algorithms have been widely used for decision-making tasks in various domains. However, the performance of these algorithms can be impacted by high variance and instability, particularly in environments with noise or sparse rewards. In this paper, we propose a framework to perform statistical online inference for a sample-averaged Q-learning approach. We adapt the functional central limit theorem (FCLT) for the modified algorithm under some general conditions and then construct confidence intervals for the Q-values via random scaling. We conduct experiments to perform inference on both the modified approach and its traditional counterpart, Q-learning using random scaling and report their coverage rates and confidence interval widths on two problems: a grid world problem as a simple toy example and a dynamic resource-matching problem as a real-world example for comparison between the two solution approaches.

Online Statistical Inference of Constant Sample-averaged Q-Learning

Abstract

Reinforcement learning algorithms have been widely used for decision-making tasks in various domains. However, the performance of these algorithms can be impacted by high variance and instability, particularly in environments with noise or sparse rewards. In this paper, we propose a framework to perform statistical online inference for a sample-averaged Q-learning approach. We adapt the functional central limit theorem (FCLT) for the modified algorithm under some general conditions and then construct confidence intervals for the Q-values via random scaling. We conduct experiments to perform inference on both the modified approach and its traditional counterpart, Q-learning using random scaling and report their coverage rates and confidence interval widths on two problems: a grid world problem as a simple toy example and a dynamic resource-matching problem as a real-world example for comparison between the two solution approaches.

Paper Structure

This paper contains 11 sections, 3 theorems, 18 equations, 2 figures, 2 tables.

Key Result

Theorem 1

Assume access to a generative model for each state-action pair $(s, a) \in \mathcal{S} \times \mathcal{A}$. Let $\textbf{Q} \in \mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|}$ be the matrix form of the Q-function $Q(s, a)$. Then there exists a constant $\eta_0 > 0$ such that for any $\eta \in (0, \et

Figures (2)

  • Figure 1: Grid World
  • Figure 2: Dynamic Resource-matching

Theorems & Definitions (3)

  • Theorem 1
  • Corollary 1
  • Theorem 2