$π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

Siting Wang; Xiaofeng Wang; Zheng Zhu; Minnan Pei; Xinyu Cui; Cheng Deng; Jian Zhao; Guan Huang; Haifeng Zhang; Jun Wang

$π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

Siting Wang, Xiaofeng Wang, Zheng Zhu, Minnan Pei, Xinyu Cui, Cheng Deng, Jian Zhao, Guan Huang, Haifeng Zhang, Jun Wang

TL;DR

This work proposes a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks, and achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features.

Abstract

Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textbf{\textit{$\boldsymbolπ$-StepNFT}} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, $π$-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.

$π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

TL;DR

Abstract

-StepNFT}} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically,

-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.

Paper Structure (47 sections, 8 theorems, 47 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 47 sections, 8 theorems, 47 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Related Works
Online RL for VLAs
Policy Optimization for Generative Models
Preliminaries
Flow Matching for VLA Models
RL Fine-tuning and the Likelihood Gap
Method
Step-wise Transitions and Mirror Errors
$\pi$-StepNFT: Step-wise Contrastive Objective
Validity and Optimized Direction
Oracle direction from posterior splits (not directly computable).
Computable surrogate via mirrored transitions (what we actually optimize).
Comparison with Diffusion-NFT (Weighted-MSE)
Experiments
...and 32 more sections

Key Result

Lemma 4.2

Under the shared covariance $\Sigma_t$, the difference in squared errors is proportional to the log-likelihood ratio of the two mirrored branches:

Figures (5)

Figure 1: Comparison of training paradigms.Left (ODE): Terminal supervision is well-posed for deterministic ODEs but results in a narrow expert manifold. Middle (Naive SDE): Stochastic rollouts introduce a wider exploration space, but coarse terminal supervision fails to correct deviations, leading to misalignment. Right ($\pi$-StepNFT): Our method leverages the wider space from SDE but applies finer, step-wise ranking guidance to ensure robust alignment with the expert manifold.
Figure 2: Flow-SDE sampling and step-wise supervision improve on-policy stability.
Figure 3: Contrastive ranking enables stable critic-free learning.
Figure 4: Hyperparameter sensitivity analysis. Configuration selected for main experiments is highlighted by the bold pink curves.
Figure 5: Step selection ablation. Performance comparison between uniform random solver-step sampling and fixed-step selection strategies.

Theorems & Definitions (9)

Definition 4.1: $\pi$-StepNFT Objective
Lemma 4.2: Log-Likelihood Ratio
Proposition 4.3: Bayes Monotonicity
Theorem 4.4: Gradient Form and Small-Step Alignment
Theorem 4.5: Separation Penalty in wMSE
Lemma 1.1: Distribution Split (Diffusion-NFT)
Lemma 1.2: Posterior Split (Diffusion-NFT)
Corollary 1.3: Posterior Expectation Split
Lemma 1.4: Oracle Velocity/Mean Splits (for alignment)

$π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

TL;DR

Abstract

$π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (9)