Table of Contents
Fetching ...

Asymptotic Analysis of Sample-averaged Q-learning

Saunak Kumar Panda, Ruiqi Liu, Yisha Xiang

TL;DR

This work develops a generalized time-varying batch-averaged Q-learning framework (SA-QL) and establishes an FCLT-based asymptotic normality for the trajectory-averaged Q-values under mild conditions. By introducing a batch-scheduling parameter and a random-scaling online inference method, the paper enables valid confidence intervals without variance estimation, even with Markovian data dependence. Theoretical results show an $O(N^{-1/2})$ convergence rate for the total sample usage and characterize how batch growth affects asymptotic variance through a beta-dependent limit. Empirical results on stochastic OpenAI Gym environments demonstrate that moderate batch growth improves sample efficiency and narrows CI widths, while overly aggressive batching can hurt performance, highlighting a practical trade-off for uncertainty quantification in RL.

Abstract

Reinforcement learning (RL) has emerged as a key approach for training agents in complex and uncertain environments. Incorporating statistical inference in RL algorithms is essential for understanding and managing uncertainty in model performance. This paper introduces a generalized framework for time-varying batch-averaged Q-learning, termed sample-averaged Q-learning (SA-QL), which extends traditional single-sample Q-learning by aggregating samples of rewards and next states to better account for data variability and uncertainty. We leverage the functional central limit theorem (FCLT) to establish a novel framework that provides insights into the asymptotic normality of the sample-averaged algorithm under mild conditions. Additionally, we develop a random scaling method for interval estimation, enabling the construction of confidence intervals without requiring extra hyperparameters. Extensive numerical experiments across classic stochastic OpenAI Gym environments, including windy gridworld and slippery frozenlake, demonstrate how different batch scheduling strategies affect learning efficiency, coverage rates, and confidence interval widths. This work establishes a unified theoretical foundation for sample-averaged Q-learning, providing insights into effective batch scheduling and statistical inference for RL algorithms.

Asymptotic Analysis of Sample-averaged Q-learning

TL;DR

This work develops a generalized time-varying batch-averaged Q-learning framework (SA-QL) and establishes an FCLT-based asymptotic normality for the trajectory-averaged Q-values under mild conditions. By introducing a batch-scheduling parameter and a random-scaling online inference method, the paper enables valid confidence intervals without variance estimation, even with Markovian data dependence. Theoretical results show an convergence rate for the total sample usage and characterize how batch growth affects asymptotic variance through a beta-dependent limit. Empirical results on stochastic OpenAI Gym environments demonstrate that moderate batch growth improves sample efficiency and narrows CI widths, while overly aggressive batching can hurt performance, highlighting a practical trade-off for uncertainty quantification in RL.

Abstract

Reinforcement learning (RL) has emerged as a key approach for training agents in complex and uncertain environments. Incorporating statistical inference in RL algorithms is essential for understanding and managing uncertainty in model performance. This paper introduces a generalized framework for time-varying batch-averaged Q-learning, termed sample-averaged Q-learning (SA-QL), which extends traditional single-sample Q-learning by aggregating samples of rewards and next states to better account for data variability and uncertainty. We leverage the functional central limit theorem (FCLT) to establish a novel framework that provides insights into the asymptotic normality of the sample-averaged algorithm under mild conditions. Additionally, we develop a random scaling method for interval estimation, enabling the construction of confidence intervals without requiring extra hyperparameters. Extensive numerical experiments across classic stochastic OpenAI Gym environments, including windy gridworld and slippery frozenlake, demonstrate how different batch scheduling strategies affect learning efficiency, coverage rates, and confidence interval widths. This work establishes a unified theoretical foundation for sample-averaged Q-learning, providing insights into effective batch scheduling and statistical inference for RL algorithms.

Paper Structure

This paper contains 27 sections, 111 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Variance Multiplier vs. Sample Size for Different Betas
  • Figure 2: RMSE vs. Number of Samples for different $B_{\text{init}}$ and $\beta$ in the Windy Gridworld environment
  • Figure 3: RMSE vs. Number of Samples for different $B_{\text{init}}$ and $\beta$ in the Slippery Frozenlake environment
  • Figure 4: Sampling distribution of $\kappa_\beta$ for $\beta=0$
  • Figure 5: Sampling distribution of $\kappa_\beta$ for $\beta=0.05$
  • ...and 4 more figures

Theorems & Definitions (15)

  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • ...and 5 more