Table of Contents
Fetching ...

HyperVLA: Efficient Inference in Vision-Language-Action Models via Hypernetworks

Zheng Xiong, Kang Li, Zilin Wang, Matthew Jackson, Jakob Foerster, Shimon Whiteson

TL;DR

This work tackles the high inference cost of Vision-Language-Action models by introducing HyperVLA, an architecture that decouples inter-task knowledge from per-timestep control via a high-capacity hypernetwork that generates a task-specific base policy. At test time, a compact base policy runs at every timestep after a single HN invocation per episode, dramatically reducing activated parameters and latency while preserving generalization across tasks. The authors engineer stability and efficiency through a vision backbone (DINOv2), context-embedding normalization, and a simple linear action head trained with MSE, achieving performance on par with or better than monolithic VLAs like OpenVLA, with substantial inference speedups. Extensive experiments on SIMPLER and LIBERO demonstrate strong zero-shot and few-shot capabilities and large improvements in inference efficiency, suggesting a practical path to deployable, language-conditioned robotic policies.

Abstract

Built upon language and vision foundation models with strong generalization ability and trained on large-scale robotic data, Vision-Language-Action (VLA) models have recently emerged as a promising approach to learning generalist robotic policies. However, a key drawback of existing VLAs is their extremely high inference costs. In this paper, we propose HyperVLA to address this problem. Unlike existing monolithic VLAs that activate the whole model during both training and inference, HyperVLA uses a novel hypernetwork (HN)-based architecture that activates only a small task-specific policy during inference, while still retaining the high model capacity needed to accommodate diverse multi-task behaviors during training. Successfully training an HN-based VLA is nontrivial so HyperVLA contains several key algorithm design features that improve its performance, including properly utilizing the prior knowledge from existing vision foundation models, HN normalization, and an action generation strategy. Compared to monolithic VLAs, HyperVLA achieves a similar or even higher success rate for both zero-shot generalization and few-shot adaptation, while significantly reducing inference costs. Compared to OpenVLA, a state-of-the-art VLA model, HyperVLA reduces the number of activated parameters at test time by $90\times$, and accelerates inference speed by $120\times$. Code is publicly available at https://github.com/MasterXiong/HyperVLA

HyperVLA: Efficient Inference in Vision-Language-Action Models via Hypernetworks

TL;DR

This work tackles the high inference cost of Vision-Language-Action models by introducing HyperVLA, an architecture that decouples inter-task knowledge from per-timestep control via a high-capacity hypernetwork that generates a task-specific base policy. At test time, a compact base policy runs at every timestep after a single HN invocation per episode, dramatically reducing activated parameters and latency while preserving generalization across tasks. The authors engineer stability and efficiency through a vision backbone (DINOv2), context-embedding normalization, and a simple linear action head trained with MSE, achieving performance on par with or better than monolithic VLAs like OpenVLA, with substantial inference speedups. Extensive experiments on SIMPLER and LIBERO demonstrate strong zero-shot and few-shot capabilities and large improvements in inference efficiency, suggesting a practical path to deployable, language-conditioned robotic policies.

Abstract

Built upon language and vision foundation models with strong generalization ability and trained on large-scale robotic data, Vision-Language-Action (VLA) models have recently emerged as a promising approach to learning generalist robotic policies. However, a key drawback of existing VLAs is their extremely high inference costs. In this paper, we propose HyperVLA to address this problem. Unlike existing monolithic VLAs that activate the whole model during both training and inference, HyperVLA uses a novel hypernetwork (HN)-based architecture that activates only a small task-specific policy during inference, while still retaining the high model capacity needed to accommodate diverse multi-task behaviors during training. Successfully training an HN-based VLA is nontrivial so HyperVLA contains several key algorithm design features that improve its performance, including properly utilizing the prior knowledge from existing vision foundation models, HN normalization, and an action generation strategy. Compared to monolithic VLAs, HyperVLA achieves a similar or even higher success rate for both zero-shot generalization and few-shot adaptation, while significantly reducing inference costs. Compared to OpenVLA, a state-of-the-art VLA model, HyperVLA reduces the number of activated parameters at test time by , and accelerates inference speed by . Code is publicly available at https://github.com/MasterXiong/HyperVLA

Paper Structure

This paper contains 39 sections, 1 equation, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Comparison between the high-level framework of monolithic VLA (left) and HN-based VLA (right). We use orange to represent parameters activated during training, and blue to represent parameters activated at every timestep during inference. The monolithic VLA activates the whole model during both training and inference and is thus colored both orange and blue. By contrast, an HN-based VLA calls the HN at a low frequency only at the beginning of a new episode at test time, and calls a compact base network at every timestep for action prediction.
  • Figure 2: The framework of HyperVLA. The trainable parameters are marked as green blocks, while the HN-generated parameters are marked as light grey blocks with dashed edges.