Table of Contents
Fetching ...

Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning

Weidong Liu, Jiyuan Tu, Xi Chen, Yichen Zhang

TL;DR

The paper tackles robust online policy evaluation in reinforcement learning under dependent data, outliers, and heavy-tailed rewards. It introduces ROPE, a fully online second-order Newton-type method that uses a smoothed Huber loss to achieve robust estimation without step-size tuning, and it establishes a Bahadur-type representation and asymptotic normality with variance $H^{-1}\Sigma(H^{\top})^{-1}$. An online estimator of the long-run covariance matrix enables valid confidence intervals in a streaming setting. Empirical results on infinite-horizon MDPs, FrozenLake, and MIMIC-III demonstrate improved uncertainty quantification, narrower confidence intervals, and reduced computation compared to online baselines, highlighting practical robustness and efficiency of the approach.

Abstract

Reinforcement learning has emerged as one of the prominent topics attracting attention in modern statistical learning, with policy evaluation being a key component. Unlike the traditional machine learning literature on this topic, our work emphasizes statistical inference for the model parameters and value functions of reinforcement learning algorithms. While most existing analyses assume random rewards to follow standard distributions, we embrace the concept of robust statistics in reinforcement learning by simultaneously addressing issues of outlier contamination and heavy-tailed rewards within a unified framework. In this paper, we develop a fully online robust policy evaluation procedure, and establish the Bahadur-type representation of our estimator. Furthermore, we develop an online procedure to efficiently conduct statistical inference based on the asymptotic distribution. This paper connects robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to online policy evaluation. Finally, we validate the efficacy of our algorithm through numerical experiments conducted in simulations and real-world reinforcement learning experiments.

Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning

TL;DR

The paper tackles robust online policy evaluation in reinforcement learning under dependent data, outliers, and heavy-tailed rewards. It introduces ROPE, a fully online second-order Newton-type method that uses a smoothed Huber loss to achieve robust estimation without step-size tuning, and it establishes a Bahadur-type representation and asymptotic normality with variance . An online estimator of the long-run covariance matrix enables valid confidence intervals in a streaming setting. Empirical results on infinite-horizon MDPs, FrozenLake, and MIMIC-III demonstrate improved uncertainty quantification, narrower confidence intervals, and reduced computation compared to online baselines, highlighting practical robustness and efficiency of the approach.

Abstract

Reinforcement learning has emerged as one of the prominent topics attracting attention in modern statistical learning, with policy evaluation being a key component. Unlike the traditional machine learning literature on this topic, our work emphasizes statistical inference for the model parameters and value functions of reinforcement learning algorithms. While most existing analyses assume random rewards to follow standard distributions, we embrace the concept of robust statistics in reinforcement learning by simultaneously addressing issues of outlier contamination and heavy-tailed rewards within a unified framework. In this paper, we develop a fully online robust policy evaluation procedure, and establish the Bahadur-type representation of our estimator. Furthermore, we develop an online procedure to efficiently conduct statistical inference based on the asymptotic distribution. This paper connects robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to online policy evaluation. Finally, we validate the efficacy of our algorithm through numerical experiments conducted in simulations and real-world reinforcement learning experiments.
Paper Structure (19 sections, 18 theorems, 176 equations, 8 figures, 1 algorithm)

This paper contains 19 sections, 18 theorems, 176 equations, 8 figures, 1 algorithm.

Key Result

Theorem 1

Suppose that cond:mix to cond:Bbound hold and the thresholding parameter $\tau_i =C_{\tau} \max(1,i^{\beta_1}/(\log i)^{\beta_2})$ (where $\beta_1\in[0,1),\beta_2\geq0$, and $C_{\tau}>0$). Assume $n_{0}$ is sufficiently large and the initial value $|\widehat{\boldsymbol{\theta}}_{0}-\boldsymbol{\the where Here $\delta$ is defined in the moment condition in cond:Bbound.

Figures (8)

  • Figure 1: Coverage probability (the first column) and the width of confidence interval (the second column) of $\mathrm{ROPE}$ of various $C$ and $\beta$. We specify the noise distribution as standard normal (the first row) and Student's $t_{2.25}$ (the second row).
  • Figure 2: Coverage probability (the first column), the width of confidence interval (the second column), and computing time (the third column) of $\mathrm{ROPE}$ and $\mathrm{LSA}$. We specify the noise distribution as standard normal (the first row) and Student's $t_{2.25}$ (the second row).
  • Figure 3: Coverage probability (left), and the width of confidence interval (right) of $\mathrm{ROPE}$ and $\mathrm{LSA}$ of various $\alpha$ and $\eta$. We specify the noise distribution as standard normal.
  • Figure 4: Coverage probability (the first row) and the width of confidence interval (the second row) of $\mathrm{ROPE}$ of various $C$ and $\beta$. We set the contamination rate to be $0$ (the first column), $n^{-1}$ (the second column), and $0.05n^{-1/2}$ (the third column), respectively.
  • Figure 5: Coverage probability (the first row), the width of confidence interval (the second row), and computing time (the third row) of $\mathrm{ROPE}$ and $\mathrm{LSA}$. We set the contamination rate to be $0$ (the first column), $n^{-1}$ (the second column), and $0.05n^{-1/2}$ (the third column), respectively.
  • ...and 3 more figures

Theorems & Definitions (35)

  • Theorem 1
  • Corollary 2
  • Corollary 3
  • Theorem 4
  • Proposition 5
  • Remark 1: Acceleration of Convergence
  • Remark 2
  • Theorem 6
  • Corollary 7
  • Lemma 1
  • ...and 25 more