Table of Contents
Fetching ...

FP-IRL: Fokker-Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes

Chengyang Huang, Siddhartha Srivastava, Kenneth K. Y. Ho, Kathy E. Luker, Gary D. Luker, Xun Huan, Krishna Garikipati

TL;DR

FP-IRL tackles the IRL challenge when transition dynamics are unknown by proposing a physics-constrained framework that treats Markov decision processes through Fokker-Planck transport. It jointly learns the FP potential $\psi$ and inverse temperature $\beta$ from trajectory-derived densities via variational system identification, then recovers the transition $T$, reward $R$, and policy $\pi$ using closed-form relations like the inverse Bellman equation and Boltzmann policy. The approach yields interpretable, physics-grounded representations and demonstrates accurate recovery on synthetic Grid World and Mountain Car benchmarks, with convergence observed under mesh refinement. This framework opens pathways for robust, physics-informed IRL in continuous stochastic systems where explicit transition models are unavailable, with potential impact in biology, physics, and decision-making applications.

Abstract

Inverse reinforcement learning (IRL) is a powerful paradigm for uncovering the incentive structure that drives agent behavior, by inferring an unknown reward function from observed trajectories within a Markov decision process (MDP). However, most existing IRL methods require access to the transition function, either prescribed or estimated \textit{a priori}, which poses significant challenges when the underlying dynamics are unknown, unobservable, or not easily sampled. We propose Fokker--Planck inverse reinforcement learning (FP-IRL), a novel physics-constrained IRL framework tailored for systems governed by Fokker--Planck (FP) dynamics. FP-IRL simultaneously infers both the reward and transition functions directly from trajectory data, without requiring access to sampled transitions. Our method leverages a conjectured equivalence between MDPs and the FP equation, linking reward maximization in MDPs with free energy minimization in FP dynamics. This connection enables inference of the potential function using our inference approach of variational system identification, from which the full set of MDP components -- reward, transition, and policy -- can be recovered using analytic expressions. We demonstrate the effectiveness of FP-IRL through experiments on synthetic benchmarks and a modified version of the Mountain Car problem. Our results show that FP-IRL achieves accurate recovery of agent incentives while preserving computational efficiency and physical interpretability.

FP-IRL: Fokker-Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes

TL;DR

FP-IRL tackles the IRL challenge when transition dynamics are unknown by proposing a physics-constrained framework that treats Markov decision processes through Fokker-Planck transport. It jointly learns the FP potential and inverse temperature from trajectory-derived densities via variational system identification, then recovers the transition , reward , and policy using closed-form relations like the inverse Bellman equation and Boltzmann policy. The approach yields interpretable, physics-grounded representations and demonstrates accurate recovery on synthetic Grid World and Mountain Car benchmarks, with convergence observed under mesh refinement. This framework opens pathways for robust, physics-informed IRL in continuous stochastic systems where explicit transition models are unavailable, with potential impact in biology, physics, and decision-making applications.

Abstract

Inverse reinforcement learning (IRL) is a powerful paradigm for uncovering the incentive structure that drives agent behavior, by inferring an unknown reward function from observed trajectories within a Markov decision process (MDP). However, most existing IRL methods require access to the transition function, either prescribed or estimated \textit{a priori}, which poses significant challenges when the underlying dynamics are unknown, unobservable, or not easily sampled. We propose Fokker--Planck inverse reinforcement learning (FP-IRL), a novel physics-constrained IRL framework tailored for systems governed by Fokker--Planck (FP) dynamics. FP-IRL simultaneously infers both the reward and transition functions directly from trajectory data, without requiring access to sampled transitions. Our method leverages a conjectured equivalence between MDPs and the FP equation, linking reward maximization in MDPs with free energy minimization in FP dynamics. This connection enables inference of the potential function using our inference approach of variational system identification, from which the full set of MDP components -- reward, transition, and policy -- can be recovered using analytic expressions. We demonstrate the effectiveness of FP-IRL through experiments on synthetic benchmarks and a modified version of the Mountain Car problem. Our results show that FP-IRL achieves accurate recovery of agent incentives while preserving computational efficiency and physical interpretability.
Paper Structure (39 sections, 3 theorems, 58 equations, 14 figures, 1 table, 2 algorithms)

This paper contains 39 sections, 3 theorems, 58 equations, 14 figures, 1 table, 2 algorithms.

Key Result

Theorem 3.2

Let $\mathcal{T}^{\pi}: \mathcal{Q} \rightarrow \mathcal{R}$ be the inverse Bellman operator (where $\mathcal{Q}$ and $\mathcal{R}$ are the spaces of value functions and reward functions, respectively) defined as: For a given transition $T$ in eq:mdp_transition and policy $\pi$ in eq:boltzmann_policy, $\mathcal{T}^{\pi}$ is a bijective mapping.

Figures (14)

  • Figure 1: Schematic illustration of an agent's iterative interaction with the environment, modeled as an MDP.
  • Figure 2: Comparison of the objectives of RL, IRL, and FP-IRL. (a) RL learns an optimal policy given known reward and transition functions in an MDP. Using the learned policy, one can generate trajectories by interacting with the environment. The dashed arrow represents the indirect output (trajectories) of the algorithm. (b) IRL infers the reward function and corresponding policy from observed expert trajectories, assuming access to known transition dynamics. (c) FP-IRL extends IRL by simultaneously inferring both the reward and transition functions, with the latter constrained by physical principles. In all subfigures, black and red parallelograms denote inputs and outputs, respectively, while blue rectangles represent algorithmic component.
  • Figure 3: Schematic overview of the FP-IRL framework, which infers both reward and transition functions by leveraging the evolution of state-action densities under FP dynamics.
  • Figure 4: Illustration of how an MDP under a fixed policy induces a MP over the lumped state-action variable $\boldsymbol{x}=(\boldsymbol{s},\boldsymbol{a})$, with transitions governed by the joint dynamics.
  • Figure 5: Empirical validation of the free energy principle in an MDP setting. In the grid world environment, the agent's state-action distribution evolves toward a equilibrium that minimizes the free energy, consistent with \ref{['thm:equivalence']}.
  • ...and 9 more figures

Theorems & Definitions (8)

  • Conjecture 3.1: Value-Potential Equivalence
  • Theorem 3.2
  • proof : Sketch of proof
  • proof
  • Lemma B.1
  • proof
  • Theorem B.1
  • proof