Table of Contents
Fetching ...

Semiparametric Double Reinforcement Learning with Applications to Long-Term Causal Inference

Lars van der Laan, David Hubbard, Allen Tran, Nathan Kallus, Aurélien Bibaut

TL;DR

The paper develops a semiparametric framework for double reinforcement learning (DRL) to enable valid long-horizon causal inference under weaker overlap assumptions. By restricting the Q-function to a subspace and leveraging the Bellman equation as a linear inverse problem, it derives automatic, doubly robust estimators with improved efficiency, and introduces superefficient, calibration-based methods that avoid density-ratio estimation altogether. Central to the approach is Bellman calibration via fitted Q-iteration and isotonic regression, yielding estimators that are efficient for oracle dimension-reduced targets and superefficient for the original estimand, albeit with potential irregularity under local alternatives. Theoretical results provide pathwise differentiability, efficient influence functions, and conditions for nuisance-rate and empirical-process control, while numerical experiments demonstrate superior stability and coverage under limited intertemporal overlap. These contributions offer practical, scalable tools for long-term causal inference in dynamic settings where traditional nonparametric DRL faces instability and heavy computation.

Abstract

Double Reinforcement Learning (DRL) enables efficient inference for policy values in nonparametric Markov decision processes (MDPs), but existing methods face two major obstacles: (1) they require stringent intertemporal overlap conditions on state trajectories, and (2) they rely on estimating high-dimensional occupancy density ratios. Motivated by problems in long-term causal inference, we extend DRL to a semiparametric setting and develop doubly robust, automatic estimators for general linear functionals of the Q-function in infinite-horizon, time-homogeneous MDPs. By imposing structure on the Q-function, we relax the overlap conditions required by nonparametric methods and obtain efficiency gains. The second obstacle--density-ratio estimation--typically requires computationally expensive and unstable min-max optimization. To address both challenges, we introduce superefficient nonparametric estimators whose limiting variance falls below the generalized Cramer-Rao bound. These estimators treat the Q-function as a one-dimensional summary of the state-action process, reducing high-dimensional overlap requirements to a single-dimensional condition. The procedure is simple to implement: estimate and calibrate the Q-function using fitted Q-iteration, then plug the result into the target functional, thereby avoiding density-ratio estimation altogether.

Semiparametric Double Reinforcement Learning with Applications to Long-Term Causal Inference

TL;DR

The paper develops a semiparametric framework for double reinforcement learning (DRL) to enable valid long-horizon causal inference under weaker overlap assumptions. By restricting the Q-function to a subspace and leveraging the Bellman equation as a linear inverse problem, it derives automatic, doubly robust estimators with improved efficiency, and introduces superefficient, calibration-based methods that avoid density-ratio estimation altogether. Central to the approach is Bellman calibration via fitted Q-iteration and isotonic regression, yielding estimators that are efficient for oracle dimension-reduced targets and superefficient for the original estimand, albeit with potential irregularity under local alternatives. Theoretical results provide pathwise differentiability, efficient influence functions, and conditions for nuisance-rate and empirical-process control, while numerical experiments demonstrate superior stability and coverage under limited intertemporal overlap. These contributions offer practical, scalable tools for long-term causal inference in dynamic settings where traditional nonparametric DRL faces instability and heavy computation.

Abstract

Double Reinforcement Learning (DRL) enables efficient inference for policy values in nonparametric Markov decision processes (MDPs), but existing methods face two major obstacles: (1) they require stringent intertemporal overlap conditions on state trajectories, and (2) they rely on estimating high-dimensional occupancy density ratios. Motivated by problems in long-term causal inference, we extend DRL to a semiparametric setting and develop doubly robust, automatic estimators for general linear functionals of the Q-function in infinite-horizon, time-homogeneous MDPs. By imposing structure on the Q-function, we relax the overlap conditions required by nonparametric methods and obtain efficiency gains. The second obstacle--density-ratio estimation--typically requires computationally expensive and unstable min-max optimization. To address both challenges, we introduce superefficient nonparametric estimators whose limiting variance falls below the generalized Cramer-Rao bound. These estimators treat the Q-function as a one-dimensional summary of the state-action process, reducing high-dimensional overlap requirements to a single-dimensional condition. The procedure is simple to implement: estimate and calibrate the Q-function using fitted Q-iteration, then plug the result into the target functional, thereby avoiding density-ratio estimation altogether.
Paper Structure (51 sections, 15 theorems, 123 equations, 6 figures, 3 algorithms)

This paper contains 51 sections, 15 theorems, 123 equations, 6 figures, 3 algorithms.

Key Result

Theorem 1

Suppose cond::bounded holds. Then the parameter $\Psi_H : \mathcal{P} \to \mathbb{R}$ is pathwise differentiable at $P_0$ with efficient influence function $\varphi^*_{0,H}$. Moreover, for any $\overline{P} \in \mathcal{P}$ for which $\varphi^*_{\overline{P},H}$ exists, the parameter admits the expa

Figures (6)

  • Figure 1: DAG for Markov Decision Process. $Y_1$ and $S_2$ need not be observed in the experiment.
  • Figure 2: Bias, standard error (SE), and coverage across discount factors $\gamma$ for setting with limited intertemporal overlap. Subfigure (d) compares Bellman calibration and nonparametric methods in low-overlap settings ($\beta = 0.7, 0.8, 0.9$); adaptive DRL (tree) results closely resemble the nonparametric method and are omitted for clarity.
  • Figure 3: Bias, standard error (SE), and coverage across discount factors $\gamma$ for various values of $\beta$.
  • Figure : Fitted $Q$-Calibration with isotonic regression
  • Figure : Fitted Q-Iteration
  • ...and 1 more figures

Theorems & Definitions (32)

  • Example 1: Policy Value in an MDP
  • Example 2: Long-term causal effect in an A/B test
  • Theorem 1: Pathwise differentiability
  • Theorem 2
  • Example 3: Eliminating overlap dependence via time-invariant state structure
  • Theorem 3: EIF with known reward or kernel
  • Theorem 4: Model approximation error
  • Example 4: Efficiency bound under dimension reduction
  • Theorem 5: Parameter approximation error is second-order
  • Theorem 6: Asymptotic linearity and superefficiency
  • ...and 22 more