Semiparametric Double Reinforcement Learning with Applications to Long-Term Causal Inference
Lars van der Laan, David Hubbard, Allen Tran, Nathan Kallus, Aurélien Bibaut
TL;DR
The paper develops a semiparametric framework for double reinforcement learning (DRL) to enable valid long-horizon causal inference under weaker overlap assumptions. By restricting the Q-function to a subspace and leveraging the Bellman equation as a linear inverse problem, it derives automatic, doubly robust estimators with improved efficiency, and introduces superefficient, calibration-based methods that avoid density-ratio estimation altogether. Central to the approach is Bellman calibration via fitted Q-iteration and isotonic regression, yielding estimators that are efficient for oracle dimension-reduced targets and superefficient for the original estimand, albeit with potential irregularity under local alternatives. Theoretical results provide pathwise differentiability, efficient influence functions, and conditions for nuisance-rate and empirical-process control, while numerical experiments demonstrate superior stability and coverage under limited intertemporal overlap. These contributions offer practical, scalable tools for long-term causal inference in dynamic settings where traditional nonparametric DRL faces instability and heavy computation.
Abstract
Double Reinforcement Learning (DRL) enables efficient inference for policy values in nonparametric Markov decision processes (MDPs), but existing methods face two major obstacles: (1) they require stringent intertemporal overlap conditions on state trajectories, and (2) they rely on estimating high-dimensional occupancy density ratios. Motivated by problems in long-term causal inference, we extend DRL to a semiparametric setting and develop doubly robust, automatic estimators for general linear functionals of the Q-function in infinite-horizon, time-homogeneous MDPs. By imposing structure on the Q-function, we relax the overlap conditions required by nonparametric methods and obtain efficiency gains. The second obstacle--density-ratio estimation--typically requires computationally expensive and unstable min-max optimization. To address both challenges, we introduce superefficient nonparametric estimators whose limiting variance falls below the generalized Cramer-Rao bound. These estimators treat the Q-function as a one-dimensional summary of the state-action process, reducing high-dimensional overlap requirements to a single-dimensional condition. The procedure is simple to implement: estimate and calibrate the Q-function using fitted Q-iteration, then plug the result into the target functional, thereby avoiding density-ratio estimation altogether.
