HyperVLA: Efficient Inference in Vision-Language-Action Models via Hypernetworks
Zheng Xiong, Kang Li, Zilin Wang, Matthew Jackson, Jakob Foerster, Shimon Whiteson
TL;DR
This work tackles the high inference cost of Vision-Language-Action models by introducing HyperVLA, an architecture that decouples inter-task knowledge from per-timestep control via a high-capacity hypernetwork that generates a task-specific base policy. At test time, a compact base policy runs at every timestep after a single HN invocation per episode, dramatically reducing activated parameters and latency while preserving generalization across tasks. The authors engineer stability and efficiency through a vision backbone (DINOv2), context-embedding normalization, and a simple linear action head trained with MSE, achieving performance on par with or better than monolithic VLAs like OpenVLA, with substantial inference speedups. Extensive experiments on SIMPLER and LIBERO demonstrate strong zero-shot and few-shot capabilities and large improvements in inference efficiency, suggesting a practical path to deployable, language-conditioned robotic policies.
Abstract
Built upon language and vision foundation models with strong generalization ability and trained on large-scale robotic data, Vision-Language-Action (VLA) models have recently emerged as a promising approach to learning generalist robotic policies. However, a key drawback of existing VLAs is their extremely high inference costs. In this paper, we propose HyperVLA to address this problem. Unlike existing monolithic VLAs that activate the whole model during both training and inference, HyperVLA uses a novel hypernetwork (HN)-based architecture that activates only a small task-specific policy during inference, while still retaining the high model capacity needed to accommodate diverse multi-task behaviors during training. Successfully training an HN-based VLA is nontrivial so HyperVLA contains several key algorithm design features that improve its performance, including properly utilizing the prior knowledge from existing vision foundation models, HN normalization, and an action generation strategy. Compared to monolithic VLAs, HyperVLA achieves a similar or even higher success rate for both zero-shot generalization and few-shot adaptation, while significantly reducing inference costs. Compared to OpenVLA, a state-of-the-art VLA model, HyperVLA reduces the number of activated parameters at test time by $90\times$, and accelerates inference speed by $120\times$. Code is publicly available at https://github.com/MasterXiong/HyperVLA
