Activation Quantization of Vision Encoders Needs Prefixing Registers
Seunghyeon Kim, Jinho Kim, Taesun Yeom, Wonpyo Park, Kyuyeun Kim, Jaeho Lee
TL;DR
The paper addresses the challenge of activating quantization in large vision encoders by mitigating mid-layer activation outliers. It introduces RegCache, a training-free method that curates universal middle-layer registers, caches their KV representations, and deletes outlier tokens to compress the activation range, improving post-training quantization across diverse ViTs. Extensive experiments on CLIP, OpenCLIP, SigLIP, SigLIP2, and DINOv2 show RegCache consistently boosts zero-shot classification and image-text retrieval, especially at 4-bit and 6-bit quantization, with minimal computational overhead. The work reveals that outliers in vision encoders are middle-layer phenomena with universal characteristics, and that appropriate prefixing of registers can significantly improve PTQ outcomes without retraining.
Abstract
Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm that mitigates outliers in large-scale pretrained vision encoders and serves as a plug-in module that can be applied on top of other quantization methods. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.
