Table of Contents
Fetching ...

Decision Flow Policy Optimization

Jifeng Hu, Sili Huang, Siyuan Guo, Zhaogeng Liu, Li Shen, Lichao Sun, Hechang Chen, Yi Chang, Dacheng Tao

TL;DR

Decision Flow addresses the challenge of learning multi-modal action distributions in offline RL by unifying flow-based action modeling with policy optimization in a Flow MDP with discrete flow steps $\tau\in\{0,\dots,T\}$ and step size $\Delta t=1/T$. It introduces two implementations, Direction-Oriented (DF-dir) and Divergence-Oriented (DF-div), to guide intermediate flow decisions toward high-return actions and manage divergence from the behavior policy, with formal proofs of flow-critic convergence and policy improvement. The end-to-end framework integrates intermediate-flow value signals and flow matching objectives to simultaneous distribution fitting and policy improvement. Empirically, DF matches or surpasses state-of-the-art baselines on dozens of offline RL tasks in D4RL, including Adroit, underscoring its potential for robust, multi-modal robotic control.

Abstract

In recent years, generative models have shown remarkable capabilities across diverse fields, including images, videos, language, and decision-making. By applying powerful generative models such as flow-based models to reinforcement learning, we can effectively model complex multi-modal action distributions and achieve superior robotic control in continuous action spaces, surpassing the limitations of single-modal action distributions with traditional Gaussian-based policies. Previous methods usually adopt the generative models as behavior models to fit state-conditioned action distributions from datasets, with policy optimization conducted separately through additional policies using value-based sample weighting or gradient-based updates. However, this separation prevents the simultaneous optimization of multi-modal distribution fitting and policy improvement, ultimately hindering the training of models and degrading the performance. To address this issue, we propose Decision Flow, a unified framework that integrates multi-modal action distribution modeling and policy optimization. Specifically, our method formulates the action generation procedure of flow-based models as a flow decision-making process, where each action generation step corresponds to one flow decision. Consequently, our method seamlessly optimizes the flow policy while capturing multi-modal action distributions. We provide rigorous proofs of Decision Flow and validate the effectiveness through extensive experiments across dozens of offline RL environments. Compared with established offline RL baselines, the results demonstrate that our method achieves or matches the SOTA performance.

Decision Flow Policy Optimization

TL;DR

Decision Flow addresses the challenge of learning multi-modal action distributions in offline RL by unifying flow-based action modeling with policy optimization in a Flow MDP with discrete flow steps and step size . It introduces two implementations, Direction-Oriented (DF-dir) and Divergence-Oriented (DF-div), to guide intermediate flow decisions toward high-return actions and manage divergence from the behavior policy, with formal proofs of flow-critic convergence and policy improvement. The end-to-end framework integrates intermediate-flow value signals and flow matching objectives to simultaneous distribution fitting and policy improvement. Empirically, DF matches or surpasses state-of-the-art baselines on dozens of offline RL tasks in D4RL, including Adroit, underscoring its potential for robust, multi-modal robotic control.

Abstract

In recent years, generative models have shown remarkable capabilities across diverse fields, including images, videos, language, and decision-making. By applying powerful generative models such as flow-based models to reinforcement learning, we can effectively model complex multi-modal action distributions and achieve superior robotic control in continuous action spaces, surpassing the limitations of single-modal action distributions with traditional Gaussian-based policies. Previous methods usually adopt the generative models as behavior models to fit state-conditioned action distributions from datasets, with policy optimization conducted separately through additional policies using value-based sample weighting or gradient-based updates. However, this separation prevents the simultaneous optimization of multi-modal distribution fitting and policy improvement, ultimately hindering the training of models and degrading the performance. To address this issue, we propose Decision Flow, a unified framework that integrates multi-modal action distribution modeling and policy optimization. Specifically, our method formulates the action generation procedure of flow-based models as a flow decision-making process, where each action generation step corresponds to one flow decision. Consequently, our method seamlessly optimizes the flow policy while capturing multi-modal action distributions. We provide rigorous proofs of Decision Flow and validate the effectiveness through extensive experiments across dozens of offline RL environments. Compared with established offline RL baselines, the results demonstrate that our method achieves or matches the SOTA performance.

Paper Structure

This paper contains 27 sections, 14 theorems, 57 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

Lemma 4.1

(Critic Consistency) If $Q\rightarrow Q^*$, where $Q^*$ is the optimal conventional critic and $Q^f$ and $V^f$ with sufficient model capacity, and the objectives $\mathcal{L}_{Q^f}$ and $\mathcal{L}_{V^f}$ is defined as Then, we will conclude that

Figures (3)

  • Figure 1: Ablation study of Decision Flow. We investigate the importance of $a^{n-1}$, converting the flow as RL, flow value functions (i.e., flow critics) on the Gym-MuJoCo tasks. Each vertex represents one task, and the coordinates represent the evaluation results on the corresponding tasks. The average score across all tasks is shown under the name in each sub-figure.
  • Figure 2: Parameter sensitivity of behavior tradeoff parameter $\rho$. We investigate the influence of $\rho$ on the Gym-MuJoCo tasks.
  • Figure 3: Parameter sensitivity of flow time step $T$. We investigate the influence of $T$ on the Gym-MuJoCo tasks.

Theorems & Definitions (24)

  • Definition 4.1
  • Lemma 4.1
  • Lemma 4.2
  • Lemma 4.3
  • Theorem 4.1
  • Theorem 4.2
  • proof
  • Lemma E.1
  • proof
  • Lemma E.2
  • ...and 14 more