FP-IRL: Fokker-Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes
Chengyang Huang, Siddhartha Srivastava, Kenneth K. Y. Ho, Kathy E. Luker, Gary D. Luker, Xun Huan, Krishna Garikipati
TL;DR
FP-IRL tackles the IRL challenge when transition dynamics are unknown by proposing a physics-constrained framework that treats Markov decision processes through Fokker-Planck transport. It jointly learns the FP potential $\psi$ and inverse temperature $\beta$ from trajectory-derived densities via variational system identification, then recovers the transition $T$, reward $R$, and policy $\pi$ using closed-form relations like the inverse Bellman equation and Boltzmann policy. The approach yields interpretable, physics-grounded representations and demonstrates accurate recovery on synthetic Grid World and Mountain Car benchmarks, with convergence observed under mesh refinement. This framework opens pathways for robust, physics-informed IRL in continuous stochastic systems where explicit transition models are unavailable, with potential impact in biology, physics, and decision-making applications.
Abstract
Inverse reinforcement learning (IRL) is a powerful paradigm for uncovering the incentive structure that drives agent behavior, by inferring an unknown reward function from observed trajectories within a Markov decision process (MDP). However, most existing IRL methods require access to the transition function, either prescribed or estimated \textit{a priori}, which poses significant challenges when the underlying dynamics are unknown, unobservable, or not easily sampled. We propose Fokker--Planck inverse reinforcement learning (FP-IRL), a novel physics-constrained IRL framework tailored for systems governed by Fokker--Planck (FP) dynamics. FP-IRL simultaneously infers both the reward and transition functions directly from trajectory data, without requiring access to sampled transitions. Our method leverages a conjectured equivalence between MDPs and the FP equation, linking reward maximization in MDPs with free energy minimization in FP dynamics. This connection enables inference of the potential function using our inference approach of variational system identification, from which the full set of MDP components -- reward, transition, and policy -- can be recovered using analytic expressions. We demonstrate the effectiveness of FP-IRL through experiments on synthetic benchmarks and a modified version of the Mountain Car problem. Our results show that FP-IRL achieves accurate recovery of agent incentives while preserving computational efficiency and physical interpretability.
