Table of Contents
Fetching ...

VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

Changhua Xu, Jie Lu, Junyu Xuan, En Yu

TL;DR

The paper tackles the brittleness of few-shot Vision-Language-Action (VLA) adaptation under scarce demonstrations, where near-miss geometric errors cause failures despite semantically plausible actions. It reframes adaptation as a generate-then-select problem, proposing VGAS, which uses a high-recall base VLA to propose action chunks and a geometry-aware Transformer critic (Q-Chunk-Former) to perform Best-of-$N$ selection. VGAS introduces Explicit Geometric Regularization (EGR) to prevent value landscape collapse in offline RL, and proves convergence properties of the chunked Expected-Max backup operator, while ensuring temporal consistency and fine-grained spatial ranking. Experiments on LIBERO show that VGAS substantially improves success rates and robustness over SFT and standard offline-RL baselines, with EGR providing the largest gains and the Transformer critic enabling precise geometry-grounded valuations. The approach offers a practical path for robust, data-efficient VLA adaptation, albeit with increased inference latency and evaluation limited to simulation so far.

Abstract

Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss action candidates lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which explicitly shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.

VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

TL;DR

The paper tackles the brittleness of few-shot Vision-Language-Action (VLA) adaptation under scarce demonstrations, where near-miss geometric errors cause failures despite semantically plausible actions. It reframes adaptation as a generate-then-select problem, proposing VGAS, which uses a high-recall base VLA to propose action chunks and a geometry-aware Transformer critic (Q-Chunk-Former) to perform Best-of- selection. VGAS introduces Explicit Geometric Regularization (EGR) to prevent value landscape collapse in offline RL, and proves convergence properties of the chunked Expected-Max backup operator, while ensuring temporal consistency and fine-grained spatial ranking. Experiments on LIBERO show that VGAS substantially improves success rates and robustness over SFT and standard offline-RL baselines, with EGR providing the largest gains and the Transformer critic enabling precise geometry-grounded valuations. The approach offers a practical path for robust, data-efficient VLA adaptation, albeit with increased inference latency and evaluation limited to simulation so far.

Abstract

Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss action candidates lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of- selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which explicitly shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.
Paper Structure (49 sections, 9 theorems, 46 equations, 6 figures, 4 tables)

This paper contains 49 sections, 9 theorems, 46 equations, 6 figures, 4 tables.

Key Result

Proposition 1

In the tabular chunk-induced SMDP, assume bounded rewards and $\gamma^h\in(0,1)$. Then $\mathcal{T}_\mu^{N}$ in Eq. eq:chunked_emax_operator is a $\gamma^h$-contraction under $\|\cdot\|_\infty$ and has a unique fixed point $Q_\mu^{N}$. Let $\pi_\mu^{(N)} := \pi_{\mu,Q_\mu^{N}}^{(N)}$ be the induced

Figures (6)

  • Figure 1: Illustration of near-miss actions distribution under 5-shot VLA fine-tuning.
  • Figure 2: The overall framework of VGAS. Generation: A fine-tuned VLA policy proposes $N$ candidate action chunks from multimodal inputs. Selection:Q-Chunk-Former learns a scoring function $Q$ via the EGR+TD objective. Best-of-$N$ selection defines the induced policy $\pi_{\mu,Q}^{(N)}$ by maximizing over a discriminative value landscape shaped by EGR, prioritizing expert-aligned candidates and thereby mitigating geometric drift.
  • Figure 3: Visualization of the Proposal-Candidate Value Landscape: CQL vs. EGR (Ours)
  • Figure 4: Multi-view Spatial Rollouts of Action Chunks and VGAS Selection. Trajectories are reconstructed via temporal integration in orthogonal views. Blue: SFT proposals; Orange: VGAS selection; Red: Ground Truth. VGAS identifies the trajectory aligning with the expert across 3D space.
  • Figure 5: Offline Ranking Evaluation on Held-out Data.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Proposition 1: Chunked Expected--Max in tabular SMDPs
  • Proposition 2: Best-of-$N$ bound under an EGR anchoring envelope
  • Definition 1: Chunked Expected-Max Operator
  • Lemma 1: Well-definedness
  • proof
  • Lemma 2: Non-expansiveness of expected max
  • proof
  • Theorem 1: $\gamma^h$-Contraction
  • proof
  • Proposition 3: Monotonicity in $N$
  • ...and 7 more