Table of Contents
Fetching ...

Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models

Riccardo Andrea Izzo, Gianluca Bardaro, Matteo Matteucci

TL;DR

This work transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators, and observes a phenomenon where visual embeddings alone are superior for inferring task complexity due to the semantic invariance of language.

Abstract

Current research on Vision-Language-Action (VLA) models predominantly focuses on enhancing generalization through established reasoning techniques. While effective, these improvements invariably increase computational complexity and inference latency. Furthermore, these mechanisms are typically applied indiscriminately, resulting in the inefficient allocation of resources for trivial tasks while simultaneously failing to provide the uncertainty estimation necessary to prevent catastrophic failure on out-of-distribution tasks. Inspired by human cognition, we propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state. Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators. This allows the system to execute known tasks immediately (Act), reason about ambiguous scenarios (Think), and preemptively halt execution when encountering significant physical or semantic anomalies (Abstain). In our empirical analysis, we observe a phenomenon where visual embeddings alone are superior for inferring task complexity due to the semantic invariance of language. Evaluated on the LIBERO and LIBERO-PRO benchmarks as well as on a real robot, our vision-only configuration achieves 80% F1-Score using as little as 5% of training data, establishing itself as a reliable and efficient task complexity detector.

Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models

TL;DR

This work transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators, and observes a phenomenon where visual embeddings alone are superior for inferring task complexity due to the semantic invariance of language.

Abstract

Current research on Vision-Language-Action (VLA) models predominantly focuses on enhancing generalization through established reasoning techniques. While effective, these improvements invariably increase computational complexity and inference latency. Furthermore, these mechanisms are typically applied indiscriminately, resulting in the inefficient allocation of resources for trivial tasks while simultaneously failing to provide the uncertainty estimation necessary to prevent catastrophic failure on out-of-distribution tasks. Inspired by human cognition, we propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state. Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators. This allows the system to execute known tasks immediately (Act), reason about ambiguous scenarios (Think), and preemptively halt execution when encountering significant physical or semantic anomalies (Abstain). In our empirical analysis, we observe a phenomenon where visual embeddings alone are superior for inferring task complexity due to the semantic invariance of language. Evaluated on the LIBERO and LIBERO-PRO benchmarks as well as on a real robot, our vision-only configuration achieves 80% F1-Score using as little as 5% of training data, establishing itself as a reliable and efficient task complexity detector.
Paper Structure (17 sections, 10 equations, 5 figures, 2 tables)

This paper contains 17 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of our framework. Given an image observation and a language instruction of a task, the system extracts text and vision embeddings from the SmolVLA backbone. These embeddings are scored using an ensemble of GMMs and kNN. The resulting scores are consolidated into a unified score vector and processed by a custom MLP to select the optimal execution strategy. High confidence triggers direct execution (Act), while ambiguous scenarios result in additional reasoning (Think). Tasks which are entirely out-of-distibution invoke a safety fallback and prevent execution (Abstain).
  • Figure 2: Impact of GMM components $k$ on OOD detection. The F1-score peaks at $k=3$, demonstrating that while multiple Gaussians are necessary to capture the task manifold's complexity, further increases lead to overfitting and diminishing returns. Variance remains stable across the tested range.
  • Figure 3: Data Scaling and Component Ablation. Macro F1-Score performance across increasing fractions of training data for different model architectures under ID (LIBERO), partially OOD (LIBERO-PRO), and fully OOD tasks.
  • Figure 4: Confusion matrices across configurations. The GMM (vision-only) achieves the best class separation with zero leakage from fully OOD tasks into the "Act" regime. Conversely, the Baseline MLP and multimodal variants show high "Act" bias when evaluating partially OOD tasks.
  • Figure 5: Rollout examples on simulation. Representative episodes from the LIBERO and LIBERO-PRO benchmarks illustrating our adaptive framework. Known tasks are executed with high confidence (left), ambiguous scenarios trigger additional reasoning leading to success later on (center), out-of-distribution tasks are preemptively halted (right).