Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning

Weidong Liu; Jiyuan Tu; Xi Chen; Yichen Zhang

Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning

Weidong Liu, Jiyuan Tu, Xi Chen, Yichen Zhang

TL;DR

The paper tackles robust online policy evaluation in reinforcement learning under dependent data, outliers, and heavy-tailed rewards. It introduces ROPE, a fully online second-order Newton-type method that uses a smoothed Huber loss to achieve robust estimation without step-size tuning, and it establishes a Bahadur-type representation and asymptotic normality with variance $H^{-1}\Sigma(H^{\top})^{-1}$. An online estimator of the long-run covariance matrix enables valid confidence intervals in a streaming setting. Empirical results on infinite-horizon MDPs, FrozenLake, and MIMIC-III demonstrate improved uncertainty quantification, narrower confidence intervals, and reduced computation compared to online baselines, highlighting practical robustness and efficiency of the approach.

Abstract

Reinforcement learning has emerged as one of the prominent topics attracting attention in modern statistical learning, with policy evaluation being a key component. Unlike the traditional machine learning literature on this topic, our work emphasizes statistical inference for the model parameters and value functions of reinforcement learning algorithms. While most existing analyses assume random rewards to follow standard distributions, we embrace the concept of robust statistics in reinforcement learning by simultaneously addressing issues of outlier contamination and heavy-tailed rewards within a unified framework. In this paper, we develop a fully online robust policy evaluation procedure, and establish the Bahadur-type representation of our estimator. Furthermore, we develop an online procedure to efficiently conduct statistical inference based on the asymptotic distribution. This paper connects robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to online policy evaluation. Finally, we validate the efficacy of our algorithm through numerical experiments conducted in simulations and real-world reinforcement learning experiments.

Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning

TL;DR

. An online estimator of the long-run covariance matrix enables valid confidence intervals in a streaming setting. Empirical results on infinite-horizon MDPs, FrozenLake, and MIMIC-III demonstrate improved uncertainty quantification, narrower confidence intervals, and reduced computation compared to online baselines, highlighting practical robustness and efficiency of the approach.

Abstract

Paper Structure (19 sections, 18 theorems, 176 equations, 8 figures, 1 algorithm)

This paper contains 19 sections, 18 theorems, 176 equations, 8 figures, 1 algorithm.

Introduction
Related Works
Paper Organization and Notations
Online Robust Policy Evaluation in Reinforcement Learning
Online Newton-type Method for Parameter Estimation
Convergence Rate of ROPE
Asymptotic Normality and Bahadur Representation
Estimation of Long-Run Covariance Matrix and Online Statistical Inference
Numerical Experiments
Parameter Inference for Infinite-Horizon MDP
Value Inference for FrozenLake RL Environment
Online Policy Evaluation on MIMIC-III Dataset
Concluding Remarks
Appendix
Experiment of the Effect of Thresholding parameters
...and 4 more sections

Key Result

Theorem 1

Suppose that cond:mix to cond:Bbound hold and the thresholding parameter $\tau_i =C_{\tau} \max(1,i^{\beta_1}/(\log i)^{\beta_2})$ (where $\beta_1\in[0,1),\beta_2\geq0$, and $C_{\tau}>0$). Assume $n_{0}$ is sufficiently large and the initial value $|\widehat{\boldsymbol{\theta}}_{0}-\boldsymbol{\the where Here $\delta$ is defined in the moment condition in cond:Bbound.

Figures (8)

Figure 1: Coverage probability (the first column) and the width of confidence interval (the second column) of $\mathrm{ROPE}$ of various $C$ and $\beta$. We specify the noise distribution as standard normal (the first row) and Student's $t_{2.25}$ (the second row).
Figure 2: Coverage probability (the first column), the width of confidence interval (the second column), and computing time (the third column) of $\mathrm{ROPE}$ and $\mathrm{LSA}$. We specify the noise distribution as standard normal (the first row) and Student's $t_{2.25}$ (the second row).
Figure 3: Coverage probability (left), and the width of confidence interval (right) of $\mathrm{ROPE}$ and $\mathrm{LSA}$ of various $\alpha$ and $\eta$. We specify the noise distribution as standard normal.
Figure 4: Coverage probability (the first row) and the width of confidence interval (the second row) of $\mathrm{ROPE}$ of various $C$ and $\beta$. We set the contamination rate to be $0$ (the first column), $n^{-1}$ (the second column), and $0.05n^{-1/2}$ (the third column), respectively.
Figure 5: Coverage probability (the first row), the width of confidence interval (the second row), and computing time (the third row) of $\mathrm{ROPE}$ and $\mathrm{LSA}$. We set the contamination rate to be $0$ (the first column), $n^{-1}$ (the second column), and $0.05n^{-1/2}$ (the third column), respectively.
...and 3 more figures

Theorems & Definitions (35)

Theorem 1
Corollary 2
Corollary 3
Theorem 4
Proposition 5
Remark 1: Acceleration of Convergence
Remark 2
Theorem 6
Corollary 7
Lemma 1
...and 25 more

Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning

TL;DR

Abstract

Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (35)