Table of Contents
Fetching ...

Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning

Yongjae Shin, Jongseong Chae, Jongeui Park, Youngchul Sung

TL;DR

This work tackles the challenge of sample-efficient offline-to-online reinforcement learning by leveraging flow-matching-based policies enhanced with injected noise. The proposed FINO method expands the learned action space during offline pre-training and uses entropy-guided sampling to balance exploration and exploitation during online fine-tuning, all while maintaining a stable, data-driven path through a continuous normalizing flow. The approach shows strong, consistent improvements across 45 tasks in OGBench and D4RL under limited online budgets, without sacrificing offline performance. By combining a theoretically grounded noise-injected flow objective with a practical entropy-guided sampling mechanism, FINO demonstrates how expressive generative policies can be effectively harnessed for efficient online adaptation in complex environments.

Abstract

Generative models have recently demonstrated remarkable success across diverse domains, motivating their adoption as expressive policies in reinforcement learning (RL). While they have shown strong performance in offline RL, particularly where the target distribution is well defined, their extension to online fine-tuning has largely been treated as a direct continuation of offline pre-training, leaving key challenges unaddressed. In this paper, we propose Flow Matching with Injected Noise for Offline-to-Online RL (FINO), a novel method that leverages flow matching-based policies to enhance sample efficiency for offline-to-online RL. FINO facilitates effective exploration by injecting noise into policy training, thereby encouraging a broader range of actions beyond those observed in the offline dataset. In addition to exploration-enhanced flow policy training, we combine an entropy-guided sampling mechanism to balance exploration and exploitation, allowing the policy to adapt its behavior throughout online fine-tuning. Experiments across diverse, challenging tasks demonstrate that FINO consistently achieves superior performance under limited online budgets.

Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning

TL;DR

This work tackles the challenge of sample-efficient offline-to-online reinforcement learning by leveraging flow-matching-based policies enhanced with injected noise. The proposed FINO method expands the learned action space during offline pre-training and uses entropy-guided sampling to balance exploration and exploitation during online fine-tuning, all while maintaining a stable, data-driven path through a continuous normalizing flow. The approach shows strong, consistent improvements across 45 tasks in OGBench and D4RL under limited online budgets, without sacrificing offline performance. By combining a theoretically grounded noise-injected flow objective with a practical entropy-guided sampling mechanism, FINO demonstrates how expressive generative policies can be effectively harnessed for efficient online adaptation in complex environments.

Abstract

Generative models have recently demonstrated remarkable success across diverse domains, motivating their adoption as expressive policies in reinforcement learning (RL). While they have shown strong performance in offline RL, particularly where the target distribution is well defined, their extension to online fine-tuning has largely been treated as a direct continuation of offline pre-training, leaving key challenges unaddressed. In this paper, we propose Flow Matching with Injected Noise for Offline-to-Online RL (FINO), a novel method that leverages flow matching-based policies to enhance sample efficiency for offline-to-online RL. FINO facilitates effective exploration by injecting noise into policy training, thereby encouraging a broader range of actions beyond those observed in the offline dataset. In addition to exploration-enhanced flow policy training, we combine an entropy-guided sampling mechanism to balance exploration and exploitation, allowing the policy to adapt its behavior throughout online fine-tuning. Experiments across diverse, challenging tasks demonstrate that FINO consistently achieves superior performance under limited online budgets.
Paper Structure (30 sections, 6 theorems, 35 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 30 sections, 6 theorems, 35 equations, 10 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

For notational simplicity, we denote $(s_i,x_i^1)$ as $x_i$. Given a dataset $\mathcal{D}=\{x_i\}_{i=1}^{N}$, the proposed time-dependent noise injection $\epsilon_t\sim\mathcal{N}(0,\alpha_t^2 I)$ induces the following conditional probability paths of flow $\phi_t$: in which the mean $tx_i$ is equal to the mean induced from flow matching, and the variance $(1-(1-\eta)t)^2$ is greater than or equ

Figures (10)

  • Figure 1: Comparison of FQL and FINO (ours) in terms of performance and exploration patterns on the environment antmaze-giant-navigate. The green circle and red star indicate the initial and goal states, respectively.
  • Figure 2: Toy example: blue contours represent the log-density of model samples; red circles denote the dataset.
  • Figure 3: Aggregate performance across two benchmark domains. Each figure reports the averaged learning curves over the common environments within the respective domain. Full results are presented in Figures \ref{['fig:ogbench_full']} and \ref{['fig:d4rl_full']}.
  • Figure 4: Comparison between FINO and the direct action noise injection baseline. Each plot shows results aggregated over five tasks and averaged across 10 seeds, with shaded regions indicating 95% confidence intervals.
  • Figure 5: Comparison between FINO and the entropy-regulated noise scaling baseline. Full results are presented in Table \ref{['tab:full_result_ablation']}.
  • ...and 5 more figures

Theorems & Definitions (9)

  • Proposition 1
  • Theorem 1
  • Theorem 2
  • Proposition 1
  • proof
  • Theorem 2
  • proof
  • Theorem 2
  • proof