Table of Contents
Fetching ...

Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior

Haochen Niu, Kanyu Zhang, Shuyu Yin, Qinghai Guo, Peilin Liu, Fei Wen

Abstract

In real-world robotic manipulation, states typically admit a neighborhood of near-equivalent actions. That is, for each state, there exist a feasible action neighborhood (FAN) rather than a single correct action, within which motions yield indistinguishable progress. However, prevalent VLA training methodologies are directly inherited from linguistic settings and do not exploit the FAN property, thus leading to poor generalization and low sample efficiency. To address this limitation, we introduce a FAN-guided regularizer that shapes the model's output distribution to align with the geometry of FAN. Concretely, we introduce a Gaussian prior that promotes locally smooth and unimodal predictions around the preferred direction and magnitude. In extensive experiments across both reinforced finetuning (RFT) and supervised finetuning (SFT), our method achieves significant improvement in sample efficiency, and success rate in both in-distribution and out-of-distribution (OOD) scenarios. By aligning with the intrinsic action tolerance of physical manipulation, FAN-guided regularization provides a principled and practical method for sample-efficient, and generalizable VLA adaptation.

Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior

Abstract

In real-world robotic manipulation, states typically admit a neighborhood of near-equivalent actions. That is, for each state, there exist a feasible action neighborhood (FAN) rather than a single correct action, within which motions yield indistinguishable progress. However, prevalent VLA training methodologies are directly inherited from linguistic settings and do not exploit the FAN property, thus leading to poor generalization and low sample efficiency. To address this limitation, we introduce a FAN-guided regularizer that shapes the model's output distribution to align with the geometry of FAN. Concretely, we introduce a Gaussian prior that promotes locally smooth and unimodal predictions around the preferred direction and magnitude. In extensive experiments across both reinforced finetuning (RFT) and supervised finetuning (SFT), our method achieves significant improvement in sample efficiency, and success rate in both in-distribution and out-of-distribution (OOD) scenarios. By aligning with the intrinsic action tolerance of physical manipulation, FAN-guided regularization provides a principled and practical method for sample-efficient, and generalizable VLA adaptation.

Paper Structure

This paper contains 33 sections, 1 theorem, 15 equations, 30 figures, 18 tables.

Key Result

Proposition 1

For a given state $s$, and any action $a \in A$, the optimal policy $\pi_{t+1}$ that solves objective eq:fine_turning_with_RL is given by: where $\beta^* \geq 0$ is the optimal Lagrange multiplier for the trust-region constraint. $\blacktriangleleft$$\blacktriangleleft$

Figures (30)

  • Figure 1: Geometric structure of policy distribution on a ManiSkill task. (a) After the SFT warm-up stage, the policy learned a narrow, peaked distribution with a minimal FAN, resulting in poor generalization. (b) Subsequent RFT with PPO broadens the distribution, leading to improved task success; (c) Our FAN-PPO method explicitly guides the policy towards a robust Gaussian shape, achieving the highest success rate and demonstrating superior generalization.
  • Figure 2: SFT performance on OpenVLA with and without FAN-guided regularization across different OOD tasks on ManiSkill.
  • Figure 3: SFT performance on OpenVLA with and without FAN-guided regularization across different data sizes on in-distribution and three OOD tasks (Vision, Semantic, Execution).
  • Figure 4: Spatial robustness on LIBERO-Spatial, comparing OpenVLA finetuned with SFT (left) versus our FAN-SFT (right). Color indicates success rate; the black and red dashed lines are the equal-success-rate contours for each method, respectively.
  • Figure 5: Qualitative comparison under spatial perturbation. Vanilla SFT remains biased toward the seen position, whereas FAN-SFT correctly adapts to the perturbed target location.
  • ...and 25 more figures

Theorems & Definitions (3)

  • Definition 1: Feasible Action Neighborhood, FAN
  • Proposition 1: Form of the optimal policy
  • proof : Proof of Proposition 1