Asymptotic Theory for IV-Based Reinforcement Learning with Potential Endogeneity
Jin Li, Ye Luo, Zigan Wang, Xiaowei Zhang
TL;DR
The paper addresses endogeneity in sequential decision problems by introducing reinforcement bias (R-bias) that arises when data generation depends on past decisions and rewards are biased. It proposes instrument-variable based reinforcement learning algorithms (IV-Q-learning and IV-AC) and embeds them in an SA framework that accommodates iterate-dependent Markovian dynamics and policy improvement. The authors establish their asymptotic normality, derive CLTs for inference on optimal policies, and provide practical guidance for estimating and testing these policies, including handling minibatches. Through simulations and an empirical application to corporate share repurchases, IV-RL is shown to reduce R-bias and improve policy inference relative to standard methods and IPW-based alternatives. The work advances causal inference in adaptive, online learning environments and offers a path to principled policy evaluation under endogeneity in complex MDPs.
Abstract
In the standard data analysis framework, data is collected (once and for all), and then data analysis is carried out. However, with the advancement of digital technology, decision-makers constantly analyze past data and generate new data through their decisions. We model this as a Markov decision process and show that the dynamic interaction between data generation and data analysis leads to a new type of bias -- reinforcement bias -- that exacerbates the endogeneity problem in standard data analysis. We propose a class of instrument variable (IV)-based reinforcement learning (RL) algorithms to correct for the bias and establish their theoretical properties by incorporating them into a stochastic approximation (SA) framework. Our analysis accommodates iterate-dependent Markovian structures and, therefore, can be used to study RL algorithms with policy improvement. We also provide formulas for inference on optimal policies of the IV-RL algorithms. These formulas highlight how intertemporal dependencies of the Markovian environment affect the inference.
