QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Jingxuan Zhang; Yunta Hsieh; Zhongwei Wan; Haokun Lin; Xin Wang; Ziqi Wang; Yingtie Lei; Mi Zhang

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Jingxuan Zhang, Yunta Hsieh, Zhongwei Wan, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, Mi Zhang

TL;DR

QuantVLA is introduced, a training-free post-training quantization (PTQ) framework that is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.

Abstract

Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, and delivers a 1.22x speedup in end-to-end inference latency, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

TL;DR

Abstract

Paper Structure (30 sections, 30 equations, 4 figures, 6 tables)

This paper contains 30 sections, 30 equations, 4 figures, 6 tables.

Introduction
Related Work
Vision-Language-Action Models
Efficient and Compact VLA Models
Efficiency Frameworks for Pretrained VLAs
Post-Training Quantization
Method
Preliminaries on Diffusion-based VLA Models
Post-Training Quantization Setup and Emergent DiT Sensitivity
DuQuant Reparameterization.
Challenges in Implementing Quantization for VLA
QuantVLA Framework
Experiment
Experimental Settings
Model and Benchmark.
...and 15 more sections

Figures (4)

Figure 1: Comparison of representative VLA efficiency frameworks. (1) TinyVLA focuses on compact multimodal transformers and lightweight diffusion-policy heads for architectural efficiency; (2) EfficientVLA accelerates inference by pruning redundant language layers and reusing intermediate representations; (3) VLA-Cache improves throughput through key--value reuse and static caching of vision tokens; (4) MoLe-VLA adopts mixture-of-layers routing to dynamically skip computation in the language module; and (5) QuantVLA introduces a training-free PTQ framework that low-bit quantizes both language and action modules without altering the model architecture.
Figure 2: Overview of QuantVLA for VLAs with a DiT-based action head. The framework is training-free and preserves the original architecture and operator schedule. It combines: (1) a selective quantization layout that integerizes all linear layers in the LLM and all MLP layers in the DiT while keeping the attention projections $Q$, $K$, $V$, $O$ in floating point; (2) Attention Temperature Matching (ATM), a per-head scalar $\alpha$ that aligns teacher–student logits and is folded into dequantization scales; and (3) Output Head Balancing (OHB), a per-layer scalar $\beta$ that matches post-projection energy at the residual interface.
Figure 3: ATM and OHB effects across attention blocks. (Left) shows logits standard deviation. (Right) shows attention output RMS after the output projection. The figure reports three configurations: the teacher model in floating point without quantization, the quantized baseline with LLM and DiT MLP integerized, and QuantVLA with ATM in the left panel or QuantVLA with OHB in the right panel, which are evaluated on the GR00T N1.5 model.
Figure 4: Memory saving of QuantVLA over the baseline on OpenPI $\pi 0.5$ and GR00T N1.5.

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

TL;DR

Abstract

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)