HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models

Xin Yan; Zhenglin Wan; Feiyang Ye; Xingrui Yu; Hangyu Du; Yang You; Ivor Tsang

HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models

Xin Yan, Zhenglin Wan, Feiyang Ye, Xingrui Yu, Hangyu Du, Yang You, Ivor Tsang

TL;DR

The results show that HBVLA incurs only marginal success-rate degradation compared to the full-precision model, demonstrating robust deployability under tight hardware constraints, and provides a practical foundation for ultra-low-bit quantization of VLAs, enabling more reliable deployment on hardware-limited robotic platforms.

Abstract

Vision-Language-Action (VLA) models enable instruction-following embodied control, but their large compute and memory footprints hinder deployment on resource-constrained robots and edge platforms. While reducing weights to 1-bit precision through binarization can greatly improve efficiency, existing methods fail to narrow the distribution gap between binarized and full-precision weights, causing quantization errors to accumulate under long-horizon closed-loop execution and severely degrade actions. To fill this gap, we propose HBVLA, a VLA-tailored binarization framework. First, we use a policy-aware enhanced Hessian to identify weights that are truly critical for action generation. Then, we employ a sparse orthogonal transform for non-salient weights to induce a low-entropy intermediate state. Finally, we quantize both salient and non-salient weights in the Harr domain with group-wise 1-bit quantization. We have evaluated our approach on different VLAs: on LIBERO, quantized OpenVLA-OFT retains 92.2% of full-precision performance; on SimplerEnv, quantized CogAct retains 93.6%, significantly outperforming state-of-the-art binarization methods. We further validate our method on real-world evaluation suite and the results show that HBVLA incurs only marginal success-rate degradation compared to the full-precision model, demonstrating robust deployability under tight hardware constraints. Our work provides a practical foundation for ultra-low-bit quantization of VLAs, enabling more reliable deployment on hardware-limited robotic platforms.

HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models

TL;DR

Abstract

Paper Structure (34 sections, 1 theorem, 50 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 1 theorem, 50 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Vision-Language-Action Models.
Network Binarization
Methodology
Method Overview
Policy-Aware Weight Partitioning
Saliency-Aware Hybrid Binarization With Harr Transform
Experiment
Experimental Setup
Main Results
Further Analysis
Conclusion
Derivations for Hessian Rectification
Proof of the Importance-Aware Weight Update (Eq. \ref{['eq:vlmq_update']})
...and 19 more sections

Key Result

Theorem 1

Let $\mathcal{L}_{\text{Block}}(\boldsymbol{\theta})$ be the block-wise loss and let $\Delta\boldsymbol{\theta}$ be the quantization-induced perturbation. Then the loss perturbation admits a first-order approximation where $\mathbf{z}$ denotes the block output and $\Delta\mathbf{z}$ the induced output error.

Figures (4)

Figure 1: Left: The original observation highlighting a background artifact with an extreme activation magnitude (Val=106.5). Middle: The raw activation heatmap reveals an optimization landscape disproportionately dominated by these statistical outliers. Right: The overlay confirms the misalignment: the model's physical sensitivity is hijacked by distractors (e.g., the water bottle and background clutter) rather than the task-critical target (the apple), visually evidencing the dual dominance problem.
Figure 2: The pipeline of our HBVLA framework consists of two steps: (i) In Step 1, we establish a block-wise gradient probe to derive token importance scores ($S_t$) and construct a Corrected Hessian proxy to identify functionally salient weights grounded in the policy. (ii) In Step 2, we apply a hybrid quantization strategy: salient weights undergo high-fidelity residual quantization, while non-salient weights are processed via sparse orthogonal transform ($P$) and Haar wavelet transform ($\mathcal{H}$) prior to group-wise 1-bit quantization.
Figure 3: Comparison on the Mobile ALOHA experiments. Evaluation across three real-world tasks, including (a) Pick and Place, (b) Sequenced Instruction, (c) Flexible Folding. Top: Middle state image for each task. Bottom: Task-specific success rates for OpenVLA-OFT (FP Model), our HBVLA method, and baselines, including BiLLM and HBLLM.
Figure 4: Compared with the FP (full-precision) baseline achieving a 74.8% task success rate, quantizing different components of CogACT in SimpleRenv leads to varying degrees of performance degradation.

Theorems & Definitions (2)

Theorem 1: Connection between block-wise loss perturbation and output error
proof

HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models

TL;DR

Abstract

HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (2)