Post Reinforcement Learning Inference

Vasilis Syrgkanis; Ruohan Zhan

Post Reinforcement Learning Inference

Vasilis Syrgkanis, Ruohan Zhan

TL;DR

The paper addresses post-adaptive data inference for reinforcement learning by formulating the estimation of counterfactual policy values and dynamic treatment effects as semiparametric moment problems.It introduces Adaptively Weighted GMM (AW-GMM), where time-varying adaptive weights $H_i$ stabilize the nonstationary variance induced by evolving behavior policies, enabling consistent estimation and asymptotic normality under a homoscedastic residuals assumption.Key theoretical contributions include explicit $(\alpha_1,\alpha_2)$-regularizing and stabilizing weight schemes, a martingale-based central limit theorem with strong Gaussian approximation, and feasible weight constructions that approximate oracle performance in high-dimensional Markovian RL models.The framework is validated through high-dimensional simulations showing improved coverage and robustness to misspecification, and includes detailed proofs and auxiliary results for identification, consistency, Gaussian approximation, and estimation of nuisance components.

Abstract

We study estimation and inference using data collected by reinforcement learning (RL) algorithms. These algorithms adaptively experiment by interacting with individual units over multiple stages, updating their strategies based on past outcomes. Our goal is to evaluate a counterfactual policy after data collection and estimate structural parameters, such as dynamic treatment effects, that support credit assignment and quantify the impact of early actions on final outcomes. These parameters can often be defined as solutions to moment equations, motivating moment-based estimation methods developed for static data. In RL settings, however, data are often collected adaptively under nonstationary behavior policies. As a result, standard estimators fail to achieve asymptotic normality due to time-varying variance. We propose a weighted generalized method of moments (GMM) approach that uses adaptive weights to stabilize this variance. We characterize weighting schemes that ensure consistency and asymptotic normality of the weighted GMM estimators, enabling valid hypothesis testing and uniform confidence region construction. Key applications include dynamic treatment effect estimation and dynamic off-policy evaluation.

Post Reinforcement Learning Inference

TL;DR

Abstract

Paper Structure (88 sections, 38 theorems, 322 equations, 5 figures, 4 algorithms)

This paper contains 88 sections, 38 theorems, 322 equations, 5 figures, 4 algorithms.

Introduction
Our Contributions
Setup
Episodic Potential Outcome
Inference Goal: Value of Dynamic Treatment Assignment Policy
Research Question
Adaptive Episodic RL Data.
Goal.
Notation.
Identification and Estimation
Identification via Moment Equations
Adaptively Weighted Generalized Method of Moments (AW-GMM)
Consistency of AW-GMM Estimation
Consistency Weights for Decaying Exploration under Bilinear Features
Asymptotic Normality of AW-GMM Estimation
...and 73 more sections

Key Result

Lemma 1

Under Assumptions assump:exogeneity & assump:linear_blip_function, it holds that Hence, the evaluation policy value is $\theta^*_0= \mathbb{E}\left[Y-\sum_{j=1}^L\Phi_j^\top \theta_j^*\right]$.

Figures (5)

Figure 1: Causal graphs illustrating the sequential conditional exogeneity assumption for a two-period setting, showing both observed data and counterfactual data for each unit under an alternative policy $\pi$. Solid arrows represent the observed treatment assignments and state transitions in the collected data, while dotted arrows indicate counterfactual treatment assignments and state transitions under policy $\pi$.
Figure 2: Inference results of AW-GMM Estimations with different weights across varying sample size. Error bars are $95\%$ confidence intervals derived from $10^3$ simulations. Results under Oracle weights are shown in dashed line to indicate that oracle weighting requires knowledge of ground truth structure parameters and thus cannot be applied in practice. AW-GMM Estimations with Oracle and Feasible weights meet nominal coverage, while the Naive and Consistent are either under- or over-coverage.
Figure 3: Histogram of studentized statistics from Gaussian approximation \ref{['eq:strong_gaussian_approx']} at sample szie $n=5\times 10^3$. Numbers are aggregated from $10^3$ simulations. AW-GMM Estimations with Oracle and Feasible weights are asymptotically normal.
Figure 4: Estimation results of AW-GMM Estimations with different weights across varying sample size. Error bars are $95\%$ confidence intervals derived from $10^3$ simulations. Results under Oracle weights are shown in dashed line to indicate that oracle weighting requires knowledge of ground truth structure parameters and thus cannot be applied in practice. AW-GMM Estimations with Oracle and Feasible weights provide more accurate estimations for policy value $\theta_0^*$.
Figure 5: Estimation and inference results of AW-GMM Estimations with different weights under mis-specification at sample size $n=5\times 10^3$. We use polynomial approximations from degrees 1 to 5 for exponential feature mappings. Error bars are $95\%$ confidence intervals derived from $10^3$ simulations. AW-GMM Estimations under Feasible weights achieve nominal coverage and tight confidence intervals for degrees above one, consistently offering more accurate estimations with lower MSE and bias across all approximation degrees.

Theorems & Definitions (48)

Remark 1
Definition 1: Blip function, robins2004optimal
Remark 2
Lemma 1: Expected Outcome via Blip Functions
Lemma 2: Identification of Blip Functions
Remark 3
Theorem 1
Corollary 1
Remark 4
Example 1: Categorical Treatment
...and 38 more

Post Reinforcement Learning Inference

TL;DR

Abstract

Post Reinforcement Learning Inference

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (48)