Table of Contents
Fetching ...

FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning

Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Chubin Zhang, Mingcong Lei, Bo An, Ivor Tsang

TL;DR

This work addresses the weaknesses of Flow Matching (FM) policies in online reinforcement learning by introducing Flow Matching Inverse Reinforcement Learning (FM-IRL), a teacher–student framework in which a full FM model guides a lightweight MLP policy that interacts with the environment. The FM teacher provides an FM-based reward model and a distributional regularizer, enabling stable online learning while preserving the expressive knowledge of expert data. Empirical results across six navigation, locomotion, and manipulation tasks show FM-IRL improves learning efficiency, generalization, and robustness, particularly when expert data are suboptimal, and it significantly reduces inference cost compared to FM-based online policy gradients. The approach offers a practical path to deploy FM-informed agents in real-world settings and suggests opportunities to extend FM-IRL to broader demonstration-based learning problems.

Abstract

Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness, FM-based policies are inherently limited by their lack of environmental interaction and exploration. This leads to poor generalization in unseen scenarios beyond the expert demonstrations, underscoring the necessity of online interaction with environment. Unfortunately, optimizing FM policies via online interaction is challenging and inefficient due to instability in gradient computation and high inference costs. To address these issues, we propose to let a student policy with simple MLP structure explore the environment and be online updated via RL algorithm with a reward model. This reward model is associated with a teacher FM model, containing rich information of expert data distribution. Furthermore, the same teacher FM model is utilized to regularize the student policy's behavior to stabilize policy learning. Due to the student's simple architecture, we avoid the gradient instability of FM policies and enable efficient online exploration, while still leveraging the expressiveness of the teacher FM model. Extensive experiments show that our approach significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data.

FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning

TL;DR

This work addresses the weaknesses of Flow Matching (FM) policies in online reinforcement learning by introducing Flow Matching Inverse Reinforcement Learning (FM-IRL), a teacher–student framework in which a full FM model guides a lightweight MLP policy that interacts with the environment. The FM teacher provides an FM-based reward model and a distributional regularizer, enabling stable online learning while preserving the expressive knowledge of expert data. Empirical results across six navigation, locomotion, and manipulation tasks show FM-IRL improves learning efficiency, generalization, and robustness, particularly when expert data are suboptimal, and it significantly reduces inference cost compared to FM-based online policy gradients. The approach offers a practical path to deploy FM-informed agents in real-world settings and suggests opportunities to extend FM-IRL to broader demonstration-based learning problems.

Abstract

Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness, FM-based policies are inherently limited by their lack of environmental interaction and exploration. This leads to poor generalization in unseen scenarios beyond the expert demonstrations, underscoring the necessity of online interaction with environment. Unfortunately, optimizing FM policies via online interaction is challenging and inefficient due to instability in gradient computation and high inference costs. To address these issues, we propose to let a student policy with simple MLP structure explore the environment and be online updated via RL algorithm with a reward model. This reward model is associated with a teacher FM model, containing rich information of expert data distribution. Furthermore, the same teacher FM model is utilized to regularize the student policy's behavior to stabilize policy learning. Due to the student's simple architecture, we avoid the gradient instability of FM policies and enable efficient online exploration, while still leveraging the expressiveness of the teacher FM model. Extensive experiments show that our approach significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data.

Paper Structure

This paper contains 51 sections, 59 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: The FM model serves dual roles: (1) training a reward model for downstream reinforcement learning of the student policy, and (2) generating state-action pairs to regularize the student policy. The student policy interacts with the environment to collect data, which—along with expert data—is used to train the teacher FM model.
  • Figure 2: Overview of the six evaluation environments. Navigation:(a) Ant-goal tasks a quadruped agent with reaching a target position; (e) Maze2d requires an agent to navigate a 2D maze to a goal location; Locomotion: (d) Hopper requires fast and stable forward locomotion without falling; (f) Walker2d requires fast and stable forward locomotion without falling. Manipulation: (b) Hand-rotate requires dexterous in-hand rotation of a cube to a target orientation; (c) Fetch-pick requires grasping a block and placing it at a desired goal.
  • Figure 3: Learning curve of FM-IRL and baselines across 6 environments.
  • Figure 4: Learning curve of all methods in Hand-rotate environment across 6 noisy-levels.
  • Figure 5: Comparison of four algorithms for online updating FM policies
  • ...and 2 more figures