Post Reinforcement Learning Inference
Vasilis Syrgkanis, Ruohan Zhan
TL;DR
The paper addresses post-adaptive data inference for reinforcement learning by formulating the estimation of counterfactual policy values and dynamic treatment effects as semiparametric moment problems.It introduces Adaptively Weighted GMM (AW-GMM), where time-varying adaptive weights $H_i$ stabilize the nonstationary variance induced by evolving behavior policies, enabling consistent estimation and asymptotic normality under a homoscedastic residuals assumption.Key theoretical contributions include explicit $(\alpha_1,\alpha_2)$-regularizing and stabilizing weight schemes, a martingale-based central limit theorem with strong Gaussian approximation, and feasible weight constructions that approximate oracle performance in high-dimensional Markovian RL models.The framework is validated through high-dimensional simulations showing improved coverage and robustness to misspecification, and includes detailed proofs and auxiliary results for identification, consistency, Gaussian approximation, and estimation of nuisance components.
Abstract
We study estimation and inference using data collected by reinforcement learning (RL) algorithms. These algorithms adaptively experiment by interacting with individual units over multiple stages, updating their strategies based on past outcomes. Our goal is to evaluate a counterfactual policy after data collection and estimate structural parameters, such as dynamic treatment effects, that support credit assignment and quantify the impact of early actions on final outcomes. These parameters can often be defined as solutions to moment equations, motivating moment-based estimation methods developed for static data. In RL settings, however, data are often collected adaptively under nonstationary behavior policies. As a result, standard estimators fail to achieve asymptotic normality due to time-varying variance. We propose a weighted generalized method of moments (GMM) approach that uses adaptive weights to stabilize this variance. We characterize weighting schemes that ensure consistency and asymptotic normality of the weighted GMM estimators, enabling valid hypothesis testing and uniform confidence region construction. Key applications include dynamic treatment effect estimation and dynamic off-policy evaluation.
