Table of Contents
Fetching ...

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

Hyeonbeom Choi, Daechul Ahn, Youhan Lee, Taewook Kang, Seongwon Cho, Jonghyun Choi

TL;DR

This work targets robustness in Vision-Language-Action (VLA) robotic systems by addressing perceptual ambiguity and action uncertainty without additional training or verifiers. It introduces SCALE, a training-free, single-pass inference strategy that jointly modulates what the model perceives and what it does, guided by a self-uncertainty score computed from output logits. The core idea uses two references, a low-uncertainty one-hot and a high-uncertainty uniform distribution, to define $u^k_t = D_{KL}(p^k_t\|q^{low}) - D_{KL}(p^k_t\|q^{high})$, which governs both adaptive action decoding via $\tau^k_t = T_{0}\cdot\sigma(u^k_t)$ and adaptive visual attention via $\gamma_t = \kappa^{\tanh(\Delta u_{t-1})}$; these updates occur within a single forward pass. Empirically, SCALE improves state-of-the-art VLAs across multiple backbones (OpenVLA, $\pi_0$-FAST, SpatialVLA) and benchmarks (LIBERO, SIMPLER-WidowX, LIBERO-PRO-Long), outperforming training-based Test-Time Scaling methods while maintaining real-time efficiency. By grounding adaptive perception and action in Self-Uncertainty and Active Inference, SCALE enables robust, real-time robotic control under perceptual ambiguity and environmental variability.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

TL;DR

This work targets robustness in Vision-Language-Action (VLA) robotic systems by addressing perceptual ambiguity and action uncertainty without additional training or verifiers. It introduces SCALE, a training-free, single-pass inference strategy that jointly modulates what the model perceives and what it does, guided by a self-uncertainty score computed from output logits. The core idea uses two references, a low-uncertainty one-hot and a high-uncertainty uniform distribution, to define , which governs both adaptive action decoding via and adaptive visual attention via ; these updates occur within a single forward pass. Empirically, SCALE improves state-of-the-art VLAs across multiple backbones (OpenVLA, -FAST, SpatialVLA) and benchmarks (LIBERO, SIMPLER-WidowX, LIBERO-PRO-Long), outperforming training-based Test-Time Scaling methods while maintaining real-time efficiency. By grounding adaptive perception and action in Self-Uncertainty and Active Inference, SCALE enables robust, real-time robotic control under perceptual ambiguity and environmental variability.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.
Paper Structure (36 sections, 14 equations, 8 figures, 12 tables, 1 algorithm)

This paper contains 36 sections, 14 equations, 8 figures, 12 tables, 1 algorithm.

Figures (8)

  • Figure 1: Motivation of Scale. (a) Existing VLA inference relies on a fixed pipeline, where visual attention may miss task-relevant cues (left; red and green boxes) and greedy decoding commits to a single action despite plausible alternatives (right). (b) Scale addresses these limitations by jointly modulating visual perception and action based on self-uncertainty: under low uncertainty, it sharpens attention and performs near-greedy execution; under high uncertainty, it broadens attention and enables explorative sampling.
  • Figure 2: Overview of Scale. (a) Adaptive Visual Attention modulates the vision encoder's attention temperature $\gamma_t$ based on uncertainty deviation from recent history—sharpening focus when confident ($\gamma_t < 1$) and broadening exploration when uncertain ($\gamma_t > 1$). (b) Self-Uncertainty Estimation quantifies self-uncertainty $u^k$ by measuring where the predicted distribution $p^k_t$ lies relative to two references: a one-hot $q^{\mathrm{low}}$ (full certainty) and uniform $q^{\mathrm{high}}$ (full ambiguity). (c) Adaptive Action Decoding scales sampling temperature $\tau^k$ based on token-level uncertainty $u^k$—enabling near-greedy execution under confidence and diverse sampling under ambiguity. (d) Visual Attention Temperature Update compares the current step-level uncertainty $u_t$ against its recent history (EMA, $\bar{u}_{t-1}$) to obtain deviation $\Delta u_t \coloneqq u_t - \bar{u}_{t-1}$, then converts it into attention temperature $\gamma_{t+1}$—when $u_t$ exceeds the EMA ($\Delta u_t > 0$), $\gamma_{t+1} > 1$ broadens attention (explore); when below ($\Delta u_t < 0$), $\gamma_{t+1} < 1$ sharpens attention (focus).
  • Figure 3: Qualitative result of adaptive visual attention. We visualize attention from SigLIP, the vision encoder $f_{\phi}$ of OpenVLA, at $t{=}45$ when self-uncertainty suddenly increases; color indicates attention intensity (blue: low, green: medium, red: high).
  • Figure 4: Qualitative results of adaptive action decoding. We compare greedy decoding (top) and Scale (middle) on the real-world task using $\pi_0$-FAST; blue arrows indicate robot motion.
  • Figure 5: Task success rate by average $p_{\max}$. Results aggregated over 6,000 episodes across LIBERO benchmarks using OpenVLA. Episodes with low average $p_{\max}$ exhibit significantly lower success rates, indicating that $p_{\max}$ serves as a reliable signal for the model's conviction and potential failure risk.
  • ...and 3 more figures