Table of Contents
Fetching ...

Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning

Vikram Krishnamurthy, Luke Snow

Abstract

Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner's trajectory. Therefore, naive Monte Carlo estimators are prohibitively inefficient, and kernel smoothing, though common, suffers from slow convergence. We overcome this by employing Malliavin calculus to efficiently estimate the required counterfactual gradients. We reformulate the counterfactual conditioning as a ratio of unconditioned expectations involving Malliavin quantities, thus recovering standard estimation rates. We derive the necessary Malliavin derivatives and their adjoint Skorohod integral formulations for a general Langevin structure, and provide a concrete algorithmic approach which exploits these for counterfactual gradient estimation.

Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning

Abstract

Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner's trajectory. Therefore, naive Monte Carlo estimators are prohibitively inefficient, and kernel smoothing, though common, suffers from slow convergence. We overcome this by employing Malliavin calculus to efficiently estimate the required counterfactual gradients. We reformulate the counterfactual conditioning as a ratio of unconditioned expectations involving Malliavin quantities, thus recovering standard estimation rates. We derive the necessary Malliavin derivatives and their adjoint Skorohod integral formulations for a general Langevin structure, and provide a concrete algorithmic approach which exploits these for counterfactual gradient estimation.

Paper Structure

This paper contains 29 sections, 6 theorems, 57 equations, 4 figures.

Key Result

Theorem 1

Assume $\ell(X^\theta), g(X^\theta) \in L^2(\Omega)$ and $D_t\ell(X^\theta), D_tg(X^\theta) \in L^2(\Omega \times [0,T])$. Then the following conditional expectation reformulation holds Here $u$ is any process that satisfies $\blacktriangleleft$$\blacktriangleleft$

Figures (4)

  • Figure 1: Estimated gradient $\widehat{ L'}(\alpha)$ versus the true gradient $L'(\alpha)$ over a grid of target locations $\alpha$, using Monte Carlo estimation with sample size $N=5000$.
  • Figure 2: Sequential gradient estimates $\widehat{L'}(\alpha)$ versus the true gradient $L'(\alpha)$ across iterations of the outer passive Langevin chain \ref{['eq:outer_langevin_discrete']}.Occasional outliers appear, yet the overall procedure recovers the target Gibbs distribution and hidden loss function accurately (see Figure \ref{['fig:outer_chain_gibbs']} and \ref{['fig:L_reconstruction']})
  • Figure 3: Empirical histogram of the stationary distribution of the Langevin chain $\{\alpha_k\}$ ($2000$ MCMC steps after a $300$-step burn-in), compared with the target Gibbs density $\pi_\beta(\alpha)\propto e^{-\beta L(\alpha)}$.
  • Figure 4: Adaptive IRL for reconstructing the loss function $L(\cdot)$ from MCMC samples of $\exp(-\beta L(\cdot)$. Specifically, $L(\cdot) = -\frac{1}{\beta}\log(\hat{\pi}(\cdot))$, where $\hat{\pi}(\cdot)$ is the empirical MCMC distribution in Fig. \ref{['fig:outer_chain_gibbs']}.

Theorems & Definitions (10)

  • Theorem 1
  • Lemma 2: Malliavin derivative of the Langevin Diffusion
  • proof
  • Lemma 3: Choice of Skorohod Integrand
  • proof
  • Lemma 4
  • proof
  • Theorem 5: Asymptotic consistency of Algorithm 1
  • proof
  • Lemma 6