VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

Changhua Xu; Jie Lu; Junyu Xuan; En Yu

VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

Changhua Xu, Jie Lu, Junyu Xuan, En Yu

TL;DR

The paper tackles the brittleness of few-shot Vision-Language-Action (VLA) adaptation under scarce demonstrations, where near-miss geometric errors cause failures despite semantically plausible actions. It reframes adaptation as a generate-then-select problem, proposing VGAS, which uses a high-recall base VLA to propose action chunks and a geometry-aware Transformer critic (Q-Chunk-Former) to perform Best-of-$N$ selection. VGAS introduces Explicit Geometric Regularization (EGR) to prevent value landscape collapse in offline RL, and proves convergence properties of the chunked Expected-Max backup operator, while ensuring temporal consistency and fine-grained spatial ranking. Experiments on LIBERO show that VGAS substantially improves success rates and robustness over SFT and standard offline-RL baselines, with EGR providing the largest gains and the Transformer critic enabling precise geometry-grounded valuations. The approach offers a practical path for robust, data-efficient VLA adaptation, albeit with increased inference latency and evaluation limited to simulation so far.

Abstract

Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss action candidates lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which explicitly shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.

VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

TL;DR

selection. VGAS introduces Explicit Geometric Regularization (EGR) to prevent value landscape collapse in offline RL, and proves convergence properties of the chunked Expected-Max backup operator, while ensuring temporal consistency and fine-grained spatial ranking. Experiments on LIBERO show that VGAS substantially improves success rates and robustness over SFT and standard offline-RL baselines, with EGR providing the largest gains and the Transformer critic enabling precise geometry-grounded valuations. The approach offers a practical path for robust, data-efficient VLA adaptation, albeit with increased inference latency and evaluation limited to simulation so far.

Abstract

selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which explicitly shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.

Paper Structure (49 sections, 9 theorems, 46 equations, 6 figures, 4 tables)

This paper contains 49 sections, 9 theorems, 46 equations, 6 figures, 4 tables.

Introduction
Preliminary
Offline Reinforcement Learning
Action Chunking in VLAs
Few-shot VLA adaptation objective.
Methodology
Framework Overview.
Q-Chunk-Former
Optimization Objective
3.2.1 Temporal Consistency
3.2.2 Spatial Consistency
3.2.3 The Closed Loop of Spatio-Temporal Consistency
Experiment
Experiment Settings
Benchmark and Architecture.
...and 34 more sections

Key Result

Proposition 1

In the tabular chunk-induced SMDP, assume bounded rewards and $\gamma^h\in(0,1)$. Then $\mathcal{T}_\mu^{N}$ in Eq. eq:chunked_emax_operator is a $\gamma^h$-contraction under $\|\cdot\|_\infty$ and has a unique fixed point $Q_\mu^{N}$. Let $\pi_\mu^{(N)} := \pi_{\mu,Q_\mu^{N}}^{(N)}$ be the induced

Figures (6)

Figure 1: Illustration of near-miss actions distribution under 5-shot VLA fine-tuning.
Figure 2: The overall framework of VGAS. Generation: A fine-tuned VLA policy proposes $N$ candidate action chunks from multimodal inputs. Selection:Q-Chunk-Former learns a scoring function $Q$ via the EGR+TD objective. Best-of-$N$ selection defines the induced policy $\pi_{\mu,Q}^{(N)}$ by maximizing over a discriminative value landscape shaped by EGR, prioritizing expert-aligned candidates and thereby mitigating geometric drift.
Figure 3: Visualization of the Proposal-Candidate Value Landscape: CQL vs. EGR (Ours)
Figure 4: Multi-view Spatial Rollouts of Action Chunks and VGAS Selection. Trajectories are reconstructed via temporal integration in orthogonal views. Blue: SFT proposals; Orange: VGAS selection; Red: Ground Truth. VGAS identifies the trajectory aligning with the expert across 3D space.
Figure 5: Offline Ranking Evaluation on Held-out Data.
...and 1 more figures

Theorems & Definitions (17)

Proposition 1: Chunked Expected--Max in tabular SMDPs
Proposition 2: Best-of-$N$ bound under an EGR anchoring envelope
Definition 1: Chunked Expected-Max Operator
Lemma 1: Well-definedness
proof
Lemma 2: Non-expansiveness of expected max
proof
Theorem 1: $\gamma^h$-Contraction
proof
Proposition 3: Monotonicity in $N$
...and 7 more

VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

TL;DR

Abstract

VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (17)