Finite-Sample Bounds for Adaptive Inverse Reinforcement Learning using Passive Langevin Dynamics

Luke Snow; Vikram Krishnamurthy

Finite-Sample Bounds for Adaptive Inverse Reinforcement Learning using Passive Langevin Dynamics

Luke Snow, Vikram Krishnamurthy

TL;DR

The paper develops a rigorous finite-sample analysis of passive stochastic gradient Langevin dynamics (PSGLD) for adaptive inverse reinforcement learning (IRL). By passively observing a forward learner’s gradients and applying a kernel-weighted PSGLD, the inverse learner samples from the Gibbs distribution $\pi_{\infty}(\alpha) \propto \exp(-\beta J(\alpha))$, enabling nonparametric reconstruction of the forward cost function $J$ in real time. The main contributions are explicit non-asymptotic bounds on the 2-Wasserstein distance between the PSGLD law and the Gibbs target, and a kernel-density estimation scheme with an $L^1$ reconstruction bound for $J$, both proved via diffusion theory, Girsanov’s theorem, and logarithmic-Sobolev inequalities. The results generalize prior SGLD finite-sample analyses to the generalized, passive setting and provide practical parameter-choosing guidelines for achieving arbitrary accuracy in finite time. This work offers a principled, finite-time foundation for adaptive IRL across broad learning settings, including RL, Bayesian learning, and empirical risk minimization, with concrete implications for kernel-based cost reconstruction from transient demonstrations.

Abstract

This paper provides a finite-sample analysis of a passive stochastic gradient Langevin dynamics (PSGLD) algorithm. This algorithm is designed to achieve adaptive inverse reinforcement learning (IRL). Adaptive IRL aims to estimate the cost function of a forward learner performing a stochastic gradient algorithm (e.g., policy gradient reinforcement learning) by observing their estimates in real-time. The PSGLD algorithm is considered passive because it incorporates noisy gradients provided by an external stochastic gradient algorithm (forward learner), of which it has no control. The PSGLD algorithm acts as a randomized sampler to achieve adaptive IRL by reconstructing the forward learner's cost function nonparametrically from the stationary measure of a Langevin diffusion. This paper analyzes the non-asymptotic (finite-sample) performance; we provide explicit bounds on the 2-Wasserstein distance between PSGLD algorithm sample measure and the stationary measure encoding the cost function, and provide guarantees for a kernel density estimation scheme which reconstructs the cost function from empirical samples. Our analysis uses tools from the study of Markov diffusion operators. The derived bounds have both practical and theoretical significance. They provide finite-time guarantees for an adaptive IRL mechanism, and substantially generalize the analytical framework of a line of research in passive stochastic gradient algorithms.

Finite-Sample Bounds for Adaptive Inverse Reinforcement Learning using Passive Langevin Dynamics

TL;DR

, enabling nonparametric reconstruction of the forward cost function

in real time. The main contributions are explicit non-asymptotic bounds on the 2-Wasserstein distance between the PSGLD law and the Gibbs target, and a kernel-density estimation scheme with an

reconstruction bound for

, both proved via diffusion theory, Girsanov’s theorem, and logarithmic-Sobolev inequalities. The results generalize prior SGLD finite-sample analyses to the generalized, passive setting and provide practical parameter-choosing guidelines for achieving arbitrary accuracy in finite time. This work offers a principled, finite-time foundation for adaptive IRL across broad learning settings, including RL, Bayesian learning, and empirical risk minimization, with concrete implications for kernel-based cost reconstruction from transient demonstrations.

Abstract

Paper Structure (51 sections, 27 theorems, 219 equations, 4 figures, 3 algorithms)

This paper contains 51 sections, 27 theorems, 219 equations, 4 figures, 3 algorithms.

Introduction
Context
Stochastic Gradient Langevin Dynamics
Why Passive Stochastic Gradient Langevin Dynamics?
Non-Asymptotic Analysis. Extension of raginsky2017non
Generalized Inverse Learning
Main Result and Proof Technique
Organization
Modeling the Forward Learner's Data Generation Process
Assumptions on Forward Learner's Data Generation Process
Stochastic Gradient Descent Process satisfying Data-Generation Assumptions
Examples of Randomly Re-Initialization Stochastic Gradient Algorithms
Passive Langevin Dynamics for Adaptive Inverse Reinforcement Learning
Stochastic Gradient Langevin Dynamics
Inverse Learning through Passive Stochastic Gradient Langevin Dynamics
...and 36 more sections

Key Result

Proposition 1

Let $\alpha^{\epsilon}(t) = \alpha_k$ for $t \in [\epsilon k, \epsilon (k+1)]$ be the continuous-time interpolation of PSGLD eq:dt_sgld. Under assumptions (A1)-(A4) of krishnamurthy2021langevin, the process $\alpha^{\epsilon}(t)$ converges weakly to the solution of the stochastic differential equati where $W(t)$ is standard $N$-dimensional Brownian motion. Furthermore, the stochastic differential

Figures (4)

Figure 1: Schematic for adaptive inverse reinforcement learning. A forward learner evaluates sequential stochastic gradients $\{\hat{\nabla}J(\theta_k)\}_{k\in\mathbb{N}}$ of a cost function $J$, through e.g., stochastic gradient descent (SGD), to obtain the minima of $J$. An inverse learner observes $\{\hat{\nabla}J(\theta_k)\}_{k\in\mathbb{N}}$ and attempts to reconstruct the cost $J$ through the passive stochastic gradient Langevin dynamics (PSGLD) algorithm. The aim of this paper is to provide a finite-sample analysis of the PSGLD algorithm \ref{['eq:dt_psgld']}.
Figure 2: High level procedure for achieving inverse reinforcement learning. The forward learning process is represented by a stochastic gradient descent (SGD), and the inverse learner incorporates sequential SGD evaluations $\theta_k$ into its PSGLD algorithm to reconstruct $J$. The PSGLD algorithm reconstructs $J$ by approximately sampling from the Gibbs measure $\pi_{\infty}$ (then taking the log-sample density). We measure the proximity of the PSGLD algorithm to $\pi_{\infty}$ by $\mathcal{W}_2(\pi_{k},\pi_{\infty})$, the 2-Wasserstein distance between the sample law of $\alpha_k$ and the measure $\pi_{\infty}$. We control this distance by bounding it by $\mathcal{W}_2(\pi_{k},\nu_{k\epsilon}) + \mathcal{W}_2(\nu_{k\epsilon},\pi_{\infty})$, where $\nu_{k\epsilon}$ is the law of $\alpha(t)$ at time $t=k\epsilon$.
Figure 3: Schematic illustrating the operation of Algorithm \ref{['alg:costrec']}. Recall Algorithm \ref{['alg:psgld']} provides a MCMC technique for generating data points from sample measure $\pi_k$, and Theorem \ref{['thm:main1']} provides guarantees on the 2-Wasserstein distance $\mathcal{W}_2(\pi_{k},\pi_{\infty})$ between $\pi_{k}$ and the stationary Gibbs measure $\pi_{\infty}$. Algorithm \ref{['alg:costrec']} acts as a pre- and post-processing procedure for reconstructing the cost function $J$ using samples from Algorithm \ref{['alg:psgld']}. It pre-processes by initializing streams of Algorithm \ref{['alg:psgld']} with appropriately chosen parameters. It post-processes by acquiring MCMC samples $\{\alpha_{\hat{k}}^i\}_{i\in[1,T]}$ at a specified iterate $\hat{k}$, reconstructing the sample measure through kernel density estimation, and recovering the cost function by logarithmically transforming this estimated measure.
Figure 4: Theorem \ref{['thm:main1']} proof structure. First the 2-Wasserstein distance between discrete-time algorithm \ref{['eq:dt_sgld']} (with measure $\pi_{k}$) and continuous-time diffusion \ref{['eq:ct_diff']} (with measure $\nu_{k\epsilon}$) is bounded. We must introduce an intermediate process (with law $\gamma_{k\epsilon}$). Lemma \ref{['lem:MSEbd']} bounds the Wasserstein distance between $\pi_{k}$ and $\gamma_{k\epsilon}$. Lemma \ref{['lem:KL_bd']} bounds the KL-divergence between $\pi_{k}$ and $\gamma_{k\epsilon}$. Corollary \ref{['cor:Villani']} is then used, along with Lemma \ref{['lem:exp_int']} to relate this KL bound to a 2-Wasserstein bound. Proposition \ref{['prop:logsob']} is the key tool in bounding $\mathcal{W}_2(\nu_{k\epsilon},\pi_{\infty})$, establishing that $\pi_{\infty}$ satisfies a log-Sobolev inequality. We then employ exponential decay of entropy (Lemma \ref{['lem:expdecay']}) and the Otto-Villani Theorem (Lemma \ref{['lem:OVthm']}) to obtain exponential decay of $\mathcal{W}_2(\nu_{k\epsilon},\pi_{\infty})$.

Theorems & Definitions (46)

Proposition 1: Weak Convergence krishnamurthy2021langevin
Theorem 1: Finite-Sample 2-Wasserstein Bound
Theorem 2
proof
Lemma 2: Exponential decay of entropy bakry2014analysis, Th. 5.2.1
Lemma 3: Otto-Villani theorem bakry2014analysis, Th. 9.6.1
Proposition 2: Bakry 2008 bakry2008simple
Proposition 3: Cattiaux et. al. (2010) cattiaux2008note
Corollary 2: Bolley and Villani 2005 bolley2005weighted Cor. 2.3
Lemma 4
...and 36 more

Finite-Sample Bounds for Adaptive Inverse Reinforcement Learning using Passive Langevin Dynamics

TL;DR

Abstract

Finite-Sample Bounds for Adaptive Inverse Reinforcement Learning using Passive Langevin Dynamics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (46)