FP-IRL: Fokker-Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes

Chengyang Huang; Siddhartha Srivastava; Kenneth K. Y. Ho; Kathy E. Luker; Gary D. Luker; Xun Huan; Krishna Garikipati

FP-IRL: Fokker-Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes

Chengyang Huang, Siddhartha Srivastava, Kenneth K. Y. Ho, Kathy E. Luker, Gary D. Luker, Xun Huan, Krishna Garikipati

TL;DR

FP-IRL tackles the IRL challenge when transition dynamics are unknown by proposing a physics-constrained framework that treats Markov decision processes through Fokker-Planck transport. It jointly learns the FP potential $\psi$ and inverse temperature $\beta$ from trajectory-derived densities via variational system identification, then recovers the transition $T$, reward $R$, and policy $\pi$ using closed-form relations like the inverse Bellman equation and Boltzmann policy. The approach yields interpretable, physics-grounded representations and demonstrates accurate recovery on synthetic Grid World and Mountain Car benchmarks, with convergence observed under mesh refinement. This framework opens pathways for robust, physics-informed IRL in continuous stochastic systems where explicit transition models are unavailable, with potential impact in biology, physics, and decision-making applications.

Abstract

Inverse reinforcement learning (IRL) is a powerful paradigm for uncovering the incentive structure that drives agent behavior, by inferring an unknown reward function from observed trajectories within a Markov decision process (MDP). However, most existing IRL methods require access to the transition function, either prescribed or estimated \textit{a priori}, which poses significant challenges when the underlying dynamics are unknown, unobservable, or not easily sampled. We propose Fokker--Planck inverse reinforcement learning (FP-IRL), a novel physics-constrained IRL framework tailored for systems governed by Fokker--Planck (FP) dynamics. FP-IRL simultaneously infers both the reward and transition functions directly from trajectory data, without requiring access to sampled transitions. Our method leverages a conjectured equivalence between MDPs and the FP equation, linking reward maximization in MDPs with free energy minimization in FP dynamics. This connection enables inference of the potential function using our inference approach of variational system identification, from which the full set of MDP components -- reward, transition, and policy -- can be recovered using analytic expressions. We demonstrate the effectiveness of FP-IRL through experiments on synthetic benchmarks and a modified version of the Mountain Car problem. Our results show that FP-IRL achieves accurate recovery of agent incentives while preserving computational efficiency and physical interpretability.

FP-IRL: Fokker-Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes

TL;DR

and inverse temperature

from trajectory-derived densities via variational system identification, then recovers the transition

, reward

, and policy

using closed-form relations like the inverse Bellman equation and Boltzmann policy. The approach yields interpretable, physics-grounded representations and demonstrates accurate recovery on synthetic Grid World and Mountain Car benchmarks, with convergence observed under mesh refinement. This framework opens pathways for robust, physics-informed IRL in continuous stochastic systems where explicit transition models are unavailable, with potential impact in biology, physics, and decision-making applications.

Abstract

Paper Structure (39 sections, 3 theorems, 58 equations, 14 figures, 1 table, 2 algorithms)

This paper contains 39 sections, 3 theorems, 58 equations, 14 figures, 1 table, 2 algorithms.

Introduction
Problem Formulation
Preliminaries
Problem statement: IRL with physics-constrained transition inference
Fokker--Planck Inverse Reinforcement Learning
Fokker--Planck physics for learning the transition function
Free energy and its connection to the Q-function in physics-based MDPs
Free energy in statistical mechanics
Free energy in physics-based MDPs
Empirical demonstration of free energy minimization in an MDP
Optimal policy constrained by FP dynamics
Equilibrium case: free energy minimization and the Boltzmann policy
Transient case: variational policy optimization and movement limitation via Wasserstein regularization
Inverse Bellman equation
Summary of the FP-IRL algorithm
...and 24 more sections

Key Result

Theorem 3.2

Let $\mathcal{T}^{\pi}: \mathcal{Q} \rightarrow \mathcal{R}$ be the inverse Bellman operator (where $\mathcal{Q}$ and $\mathcal{R}$ are the spaces of value functions and reward functions, respectively) defined as: For a given transition $T$ in eq:mdp_transition and policy $\pi$ in eq:boltzmann_policy, $\mathcal{T}^{\pi}$ is a bijective mapping.

Figures (14)

Figure 1: Schematic illustration of an agent's iterative interaction with the environment, modeled as an MDP.
Figure 2: Comparison of the objectives of RL, IRL, and FP-IRL. (a) RL learns an optimal policy given known reward and transition functions in an MDP. Using the learned policy, one can generate trajectories by interacting with the environment. The dashed arrow represents the indirect output (trajectories) of the algorithm. (b) IRL infers the reward function and corresponding policy from observed expert trajectories, assuming access to known transition dynamics. (c) FP-IRL extends IRL by simultaneously inferring both the reward and transition functions, with the latter constrained by physical principles. In all subfigures, black and red parallelograms denote inputs and outputs, respectively, while blue rectangles represent algorithmic component.
Figure 3: Schematic overview of the FP-IRL framework, which infers both reward and transition functions by leveraging the evolution of state-action densities under FP dynamics.
Figure 4: Illustration of how an MDP under a fixed policy induces a MP over the lumped state-action variable $\boldsymbol{x}=(\boldsymbol{s},\boldsymbol{a})$, with transitions governed by the joint dynamics.
Figure 5: Empirical validation of the free energy principle in an MDP setting. In the grid world environment, the agent's state-action distribution evolves toward a equilibrium that minimizes the free energy, consistent with \ref{['thm:equivalence']}.
...and 9 more figures

Theorems & Definitions (8)

Conjecture 3.1: Value-Potential Equivalence
Theorem 3.2
proof : Sketch of proof
proof
Lemma B.1
proof
Theorem B.1
proof

FP-IRL: Fokker-Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes

TL;DR

Abstract

FP-IRL: Fokker-Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (8)