Table of Contents
Fetching ...

Flow-based Policy With Distributional Reinforcement Learning in Trajectory Optimization

Ruijie Hao, Longfei Zhang, Yang Dai, Yang Ma, Xingxing Liang, Guangquan Cheng

Abstract

Reinforcement Learning (RL) has proven highly effective in addressing complex control and decision-making tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution, which constrains the policy from capturing multimodal distributions, making it difficult to cover the full range of optimal solutions in multi-solution problems, and the return is reduced to a mean value, losing its multimodal nature and thus providing insufficient guidance for policy updates. In response to these problems, we propose a RL algorithm termed flow-based policy with distributional RL (FP-DRL). This algorithm models the policy using flow matching, which offers both computational efficiency and the capacity to fit complex distributions. Additionally, it employs a distributional RL approach to model and optimize the entire return distribution, thereby more effectively guiding multimodal policy updates and improving agent performance. Experimental trails on MuJoCo benchmarks demonstrate that the FP-DRL algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting superior representation capability of the flow policy.

Flow-based Policy With Distributional Reinforcement Learning in Trajectory Optimization

Abstract

Reinforcement Learning (RL) has proven highly effective in addressing complex control and decision-making tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution, which constrains the policy from capturing multimodal distributions, making it difficult to cover the full range of optimal solutions in multi-solution problems, and the return is reduced to a mean value, losing its multimodal nature and thus providing insufficient guidance for policy updates. In response to these problems, we propose a RL algorithm termed flow-based policy with distributional RL (FP-DRL). This algorithm models the policy using flow matching, which offers both computational efficiency and the capacity to fit complex distributions. Additionally, it employs a distributional RL approach to model and optimize the entire return distribution, thereby more effectively guiding multimodal policy updates and improving agent performance. Experimental trails on MuJoCo benchmarks demonstrate that the FP-DRL algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting superior representation capability of the flow policy.

Paper Structure

This paper contains 25 sections, 1 theorem, 23 equations, 4 figures, 1 table, 1 algorithm.

Key Result

lemma 1

The distributional Bellman operator $\mathcal{T}^\pi$ for policy evaluation is a $\gamma$-contraction under the maximal $p$-Wasserstein metric $\bar{d}_p$: $\blacktriangleleft$$\blacktriangleleft$

Figures (4)

  • Figure 1: FM learns a velocity fields that transports samples from a simple prior distribution to the target data distribution via an ordinary differential equation (ODE).
  • Figure 2: Benchmarks. (a) Humanoid-v4: $(s \times a) \in \mathbb{R}^{376} \times \mathbb{R}^{17}$. (b) Ant-v4: $(s \times a) \in \mathbb{R}^{111} \times \mathbb{R}^{8}$. (c) Hopper-v4: $(s \times a) \in \mathbb{R}^{11} \times \mathbb{R}^{3}$. (d) HalfCheetah-v4: $(s \times a) \in \mathbb{R}^{17} \times \mathbb{R}^{6}$. (e) InvertedPendulum-v: $(s \times a) \in \mathbb{R}^{4} \times \mathbb{R}^{1}$. (f) Reacher-v4: $(s \times a) \in \mathbb{R}^{11} \times \mathbb{R}^{2}$.
  • Figure 3: Training curves on benchmarks. The solid lines correspond to mean and shaded regions correspond to the standard error of the mean (SEM) over Three runs.
  • Figure 4: Ablation studies of FP-DRL. (a) Performance comparison between a Gaussian policy and the proposed transformer-based flow policy. (b) Comparison of modeling returns using the mean versus a quantile-based distributional critic. (c) Effect of varying the number of quantiles (N=16, 32, and 64) for the distributional critic. (d) Training curves across different transformer sequence lengths (K=4, 7, 10, and 12).

Theorems & Definitions (1)

  • lemma 1: Bellemare et al., 2017