Table of Contents
Fetching ...

Flow Actor-Critic for Offline Reinforcement Learning

Jongseong Chae, Jongeui Park, Yongjae Shin, Gyeongmin Kim, Seungyul Han, Youngchul Sung

TL;DR

This paper tackles offline RL in complex, multi-modal action distributions by introducing Flow Actor-Critic (FAC), which jointly leverages a flow-based policy and a density-aware behavior proxy for both actor regularization and critic penalization. The method defines a density-guided penalized Bellman operator that preserves unbiased Q-values in high-confidence (ID) regions and yields conservative Q-values in low-confidence (OOD) regions, while an enhanced one-step flow actor enables expressive but constrained policy optimization. FAC achieves state-of-the-art performance on challenging benchmarks such as D4RL and OGBench, demonstrating robust handling of OOD actions and improved scalability to high-dimensional action spaces. The practical two-stage training and density-based thresholding provide a tractable framework for reliable offline policy improvement with flow-based models, offering a promising direction for density-aware offline RL.

Abstract

The dataset distributions in offline reinforcement learning (RL) often exhibit complex and multi-modal distributions, necessitating expressive policies to capture such distributions beyond widely-used Gaussian policies. To handle such complex and multi-modal datasets, in this paper, we propose Flow Actor-Critic, a new actor-critic method for offline RL, based on recent flow policies. The proposed method not only uses the flow model for actor as in previous flow policies but also exploits the expressive flow model for conservative critic acquisition to prevent Q-value explosion in out-of-data regions. To this end, we propose a new form of critic regularizer based on the flow behavior proxy model obtained as a byproduct of flow-based actor design. Leveraging the flow model in this joint way, we achieve new state-of-the-art performance for test datasets of offline RL including the D4RL and recent OGBench benchmarks.

Flow Actor-Critic for Offline Reinforcement Learning

TL;DR

This paper tackles offline RL in complex, multi-modal action distributions by introducing Flow Actor-Critic (FAC), which jointly leverages a flow-based policy and a density-aware behavior proxy for both actor regularization and critic penalization. The method defines a density-guided penalized Bellman operator that preserves unbiased Q-values in high-confidence (ID) regions and yields conservative Q-values in low-confidence (OOD) regions, while an enhanced one-step flow actor enables expressive but constrained policy optimization. FAC achieves state-of-the-art performance on challenging benchmarks such as D4RL and OGBench, demonstrating robust handling of OOD actions and improved scalability to high-dimensional action spaces. The practical two-stage training and density-based thresholding provide a tractable framework for reliable offline policy improvement with flow-based models, offering a promising direction for density-aware offline RL.

Abstract

The dataset distributions in offline reinforcement learning (RL) often exhibit complex and multi-modal distributions, necessitating expressive policies to capture such distributions beyond widely-used Gaussian policies. To handle such complex and multi-modal datasets, in this paper, we propose Flow Actor-Critic, a new actor-critic method for offline RL, based on recent flow policies. The proposed method not only uses the flow model for actor as in previous flow policies but also exploits the expressive flow model for conservative critic acquisition to prevent Q-value explosion in out-of-data regions. To this end, we propose a new form of critic regularizer based on the flow behavior proxy model obtained as a byproduct of flow-based actor design. Leveraging the flow model in this joint way, we achieve new state-of-the-art performance for test datasets of offline RL including the D4RL and recent OGBench benchmarks.
Paper Structure (40 sections, 5 theorems, 47 equations, 8 figures, 8 tables, 2 algorithms)

This paper contains 40 sections, 5 theorems, 47 equations, 8 figures, 8 tables, 2 algorithms.

Key Result

Proposition 1

Let $\beta$ be the underlying behavior policy, $\hat{\beta}$ be our proxy for $\beta$, $\pi$ be the learned actor, and $Q$ be the value function of $\pi$. Consider the original Bellman operator $\mathcal{T}^{\pi}Q(s,a)=r(s,a)+\gamma\mathbb{E}_{s'\sim P(\cdot|s,a),\,a'\sim\pi(\cdot|s')}\left[Q(s',a') unless $\beta(a|s)=0$ and $w^{\hat{\beta}}(s,a)=0$ simultaneously.

Figures (8)

  • Figure 1: BC models on a synthetic four-component Gaussian mixture dataset: Top row - samples from each BC model. Bottom row - log-density or ELBO plot.
  • Figure 2: weight $w^{\hat{\beta}}$
  • Figure 3: Empirical analysis of components. (a) Effect of flow-based critic penalization (CP) on performance under $N$ candidate actions. (b) Performances across actor regularization coefficient $\lambda$ and critic penalization coefficient $\alpha$. (c) Influence by the fidelity of flow behavior proxy under flow steps $T$. (d) Performance of weight function designs. (e) Performance under different $\epsilon$ schemes.
  • Figure 4: Offline-to-online results, averaged over 8 seeds with standard deviation. The gray shaded region indicates the offline phase, and the online fine-tuning phase is left unshaded. Full plots on 15 tasks are in Fig. \ref{['fig-experiment-3:O2O_full']}.
  • Figure 5: Flow behavior proxy model with different step counts of the Euler method.
  • ...and 3 more figures

Theorems & Definitions (10)

  • Proposition 1
  • Proposition 1
  • proof
  • Definition 1
  • Proposition 2
  • proof
  • Theorem 1
  • proof
  • Theorem 2
  • proof