Table of Contents
Fetching ...

Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

Prajwal Koirala, Cody Fleming

TL;DR

This work proposes the Single-Step Completion Policy (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation in offline reinforcement learning.

Abstract

Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the Single-Step Completion Policy (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and behavior cloning benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making. The code is available at https://github.com/PrajwalKoirala/SSCP-Single-Step-Completion-Policy.

Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

TL;DR

This work proposes the Single-Step Completion Policy (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation in offline reinforcement learning.

Abstract

Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the Single-Step Completion Policy (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and behavior cloning benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making. The code is available at https://github.com/PrajwalKoirala/SSCP-Single-Step-Completion-Policy.

Paper Structure

This paper contains 66 sections, 40 equations, 15 figures, 17 tables, 3 algorithms.

Figures (15)

  • Figure 1: Depiction of completion-based flow matching: while velocity vectors propagate along the generative path, completion vectors enable shortcut one-step jumps to the target distribution. This forms the basis of our Single Step Completion Policy (SSCP) used in offline RL and related problems.
  • Figure 2: Hierarchy Distillation with Shortcuts
  • Figure 3: Training Curves for Offline RL with D4RL Datasets in Gym Mujoco Locomotion tasks
  • Figure 4: Training Curves for Offline to Online Finetuning with D4RL Datasets in Gym Mujoco Locomotion tasks (100k offline and 100k online steps)
  • Figure 5: Training Curves for Offline to Online Finetuning with D4RL Datasets in Gym Mujoco Locomotion tasks (250k offline and 250k online steps)
  • ...and 10 more figures