Table of Contents
Fetching ...

ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation

Songyuan Zhang, Oswin So, H. M. Sabbir Ahmad, Eric Yang Yu, Matthew Cleaveland, Mitchell Black, Chuchu Fan

TL;DR

Offline RL struggles with OOD actions and multimodal optimal policies. ReFORM tackles this by learning a bounded-source BC flow and a reflected flow noise generator, enforcing the on-support constraint by construction and avoiding hyperparameter-heavy regularization. The method yields state-of-the-art performance across 40 OGBench tasks with a single hyperparameter set, demonstrating robustness to dataset quality. By combining flow-based policy learning with noise manipulation inside a fixed support, ReFORM preserves multimodality while mitigating OOD errors, offering a practical approach for safe, scalable offline RL in real-world domains.

Abstract

Offline reinforcement learning (RL) aims to learn the optimal policy from a fixed dataset generated by behavior policies without additional environment interactions. One common challenge that arises in this setting is the out-of-distribution (OOD) error, which occurs when the policy leaves the training distribution. Prior methods penalize a statistical distance term to keep the policy close to the behavior policy, but this constrains policy improvement and may not completely prevent OOD actions. Another challenge is that the optimal policy distribution can be multimodal and difficult to represent. Recent works apply diffusion or flow policies to address this problem, but it is unclear how to avoid OOD errors while retaining policy expressiveness. We propose ReFORM, an offline RL method based on flow policies that enforces the less restrictive support constraint by construction. ReFORM learns a behavior cloning (BC) flow policy with a bounded source distribution to capture the support of the action distribution, then optimizes a reflected flow that generates bounded noise for the BC flow while keeping the support, to maximize the performance. Across 40 challenging tasks from the OGBench benchmark with datasets of varying quality and using a constant set of hyperparameters for all tasks, ReFORM dominates all baselines with hand-tuned hyperparameters on the performance profile curves.

ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation

TL;DR

Offline RL struggles with OOD actions and multimodal optimal policies. ReFORM tackles this by learning a bounded-source BC flow and a reflected flow noise generator, enforcing the on-support constraint by construction and avoiding hyperparameter-heavy regularization. The method yields state-of-the-art performance across 40 OGBench tasks with a single hyperparameter set, demonstrating robustness to dataset quality. By combining flow-based policy learning with noise manipulation inside a fixed support, ReFORM preserves multimodality while mitigating OOD errors, offering a practical approach for safe, scalable offline RL in real-world domains.

Abstract

Offline reinforcement learning (RL) aims to learn the optimal policy from a fixed dataset generated by behavior policies without additional environment interactions. One common challenge that arises in this setting is the out-of-distribution (OOD) error, which occurs when the policy leaves the training distribution. Prior methods penalize a statistical distance term to keep the policy close to the behavior policy, but this constrains policy improvement and may not completely prevent OOD actions. Another challenge is that the optimal policy distribution can be multimodal and difficult to represent. Recent works apply diffusion or flow policies to address this problem, but it is unclear how to avoid OOD errors while retaining policy expressiveness. We propose ReFORM, an offline RL method based on flow policies that enforces the less restrictive support constraint by construction. ReFORM learns a behavior cloning (BC) flow policy with a bounded source distribution to capture the support of the action distribution, then optimizes a reflected flow that generates bounded noise for the BC flow while keeping the support, to maximize the performance. Across 40 challenging tasks from the OGBench benchmark with datasets of varying quality and using a constant set of hyperparameters for all tasks, ReFORM dominates all baselines with hand-tuned hyperparameters on the performance profile curves.
Paper Structure (52 sections, 4 theorems, 23 equations, 17 figures, 7 tables, 1 algorithm)

This paper contains 52 sections, 4 theorems, 23 equations, 17 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Given a state $s\in{\mathcal{S}}$, for any $\epsilon$ such that $0\leq\epsilon<\infty$, $D_\mathrm{KL}(\pi_\theta(\cdot|s)\mid\mid\pi_\beta(\cdot|s))\leq\epsilon$ implies $\mathop{\mathrm{supp}}\nolimits(\pi_\theta(\cdot|s))\subseteq\mathop{\mathrm{supp}}\nolimits(\pi_\beta(\cdot|s))$. On the other

Figures (17)

  • Figure 1: ReFORM algorithm. The process with gray arrows indicates the BC flow policy, learned to transform a simple source distribution $q_\mathrm{BC}=\mathcal{U}({\mathcal{B}}_l^d)$ to a target distribution $p_\mathrm{BC}$ that matches the dataset ${\mathcal{D}}$. The blue arrows indicate the ReFORM process, where we learn a flow noise generator to generate a manipulated source distribution $\tilde{q}_\mathrm{BC}$ for the BC policy so that the manipulated target $\tilde{p}_\mathrm{BC}$ maximizes the $Q$ value while staying inside the support (denoted in red) of the BC policy.
  • Figure 2: Performance profile over clean and noisy datasets. For a given normalized score $\tau$ (x-axis), the performance profile shows the probability that a given method achieves a score $\geq\tau$ (see agarwal2021deep for details). On the clean dataset, ReFORM achieves greater scores with higher probabilities than all other baselines. The same is true on the noisy dataset except for a small set of normalized scores around $0.9$ where ReFORM and FQL(S) have similar probabilities within the statistical margins.
  • Figure 3: Learned policy distributions with the toy example. The $Q$-value reaches the maximum at the lower left and upper right corners (See the $Q$-value plot in \ref{['fig: algorithm']}). The red boundaries denote the estimated $\mathop{\mathrm{supp}}\nolimits(\pi_\mathrm{BC})$.
  • Figure 4: Ablations. Left: normalized scores of ReFORM and its variants with different source distributions. Right: training curves of ReFORM and its variants by changing its components.
  • Figure 5: Normalized scores with the clean dataset.
  • ...and 12 more figures

Theorems & Definitions (9)

  • Proposition 1
  • Proposition 2
  • Remark 1
  • Theorem 1
  • Theorem 2
  • proof
  • proof
  • proof
  • proof