Semiparametric Double Reinforcement Learning with Applications to Long-Term Causal Inference

Lars van der Laan; David Hubbard; Allen Tran; Nathan Kallus; Aurélien Bibaut

Semiparametric Double Reinforcement Learning with Applications to Long-Term Causal Inference

Lars van der Laan, David Hubbard, Allen Tran, Nathan Kallus, Aurélien Bibaut

TL;DR

The paper develops a semiparametric framework for double reinforcement learning (DRL) to enable valid long-horizon causal inference under weaker overlap assumptions. By restricting the Q-function to a subspace and leveraging the Bellman equation as a linear inverse problem, it derives automatic, doubly robust estimators with improved efficiency, and introduces superefficient, calibration-based methods that avoid density-ratio estimation altogether. Central to the approach is Bellman calibration via fitted Q-iteration and isotonic regression, yielding estimators that are efficient for oracle dimension-reduced targets and superefficient for the original estimand, albeit with potential irregularity under local alternatives. Theoretical results provide pathwise differentiability, efficient influence functions, and conditions for nuisance-rate and empirical-process control, while numerical experiments demonstrate superior stability and coverage under limited intertemporal overlap. These contributions offer practical, scalable tools for long-term causal inference in dynamic settings where traditional nonparametric DRL faces instability and heavy computation.

Abstract

Double Reinforcement Learning (DRL) enables efficient inference for policy values in nonparametric Markov decision processes (MDPs), but existing methods face two major obstacles: (1) they require stringent intertemporal overlap conditions on state trajectories, and (2) they rely on estimating high-dimensional occupancy density ratios. Motivated by problems in long-term causal inference, we extend DRL to a semiparametric setting and develop doubly robust, automatic estimators for general linear functionals of the Q-function in infinite-horizon, time-homogeneous MDPs. By imposing structure on the Q-function, we relax the overlap conditions required by nonparametric methods and obtain efficiency gains. The second obstacle--density-ratio estimation--typically requires computationally expensive and unstable min-max optimization. To address both challenges, we introduce superefficient nonparametric estimators whose limiting variance falls below the generalized Cramer-Rao bound. These estimators treat the Q-function as a one-dimensional summary of the state-action process, reducing high-dimensional overlap requirements to a single-dimensional condition. The procedure is simple to implement: estimate and calibrate the Q-function using fitted Q-iteration, then plug the result into the target functional, thereby avoiding density-ratio estimation altogether.

Semiparametric Double Reinforcement Learning with Applications to Long-Term Causal Inference

TL;DR

Abstract

Paper Structure (51 sections, 15 theorems, 123 equations, 6 figures, 3 algorithms)

This paper contains 51 sections, 15 theorems, 123 equations, 6 figures, 3 algorithms.

Introduction and motivation
Contributions of this work
Related work
Preliminaries
Data Structure and Markov Decision Model
Inferential objective
The Bellman equation as an inverse problem
Semiparametric double reinforcement learning
Proposed estimator
Model-robust extension and asymptotic theory
Efficiency considerations.
Sieve estimation and model approximation error
Estimation of nuisance functions
Superefficient inference via Bellman calibration
Proposed estimator
...and 36 more sections

Key Result

Theorem 1

Suppose cond::bounded holds. Then the parameter $\Psi_H : \mathcal{P} \to \mathbb{R}$ is pathwise differentiable at $P_0$ with efficient influence function $\varphi^*_{0,H}$. Moreover, for any $\overline{P} \in \mathcal{P}$ for which $\varphi^*_{\overline{P},H}$ exists, the parameter admits the expa

Figures (6)

Figure 1: DAG for Markov Decision Process. $Y_1$ and $S_2$ need not be observed in the experiment.
Figure 2: Bias, standard error (SE), and coverage across discount factors $\gamma$ for setting with limited intertemporal overlap. Subfigure (d) compares Bellman calibration and nonparametric methods in low-overlap settings ($\beta = 0.7, 0.8, 0.9$); adaptive DRL (tree) results closely resemble the nonparametric method and are omitted for clarity.
Figure 3: Bias, standard error (SE), and coverage across discount factors $\gamma$ for various values of $\beta$.
Figure : Fitted $Q$-Calibration with isotonic regression
Figure : Fitted Q-Iteration
...and 1 more figures

Theorems & Definitions (32)

Example 1: Policy Value in an MDP
Example 2: Long-term causal effect in an A/B test
Theorem 1: Pathwise differentiability
Theorem 2
Example 3: Eliminating overlap dependence via time-invariant state structure
Theorem 3: EIF with known reward or kernel
Theorem 4: Model approximation error
Example 4: Efficiency bound under dimension reduction
Theorem 5: Parameter approximation error is second-order
Theorem 6: Asymptotic linearity and superefficiency
...and 22 more

Semiparametric Double Reinforcement Learning with Applications to Long-Term Causal Inference

TL;DR

Abstract

Semiparametric Double Reinforcement Learning with Applications to Long-Term Causal Inference

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (32)