Table of Contents
Fetching ...

Activation Quantization of Vision Encoders Needs Prefixing Registers

Seunghyeon Kim, Jinho Kim, Taesun Yeom, Wonpyo Park, Kyuyeun Kim, Jaeho Lee

TL;DR

The paper addresses the challenge of activating quantization in large vision encoders by mitigating mid-layer activation outliers. It introduces RegCache, a training-free method that curates universal middle-layer registers, caches their KV representations, and deletes outlier tokens to compress the activation range, improving post-training quantization across diverse ViTs. Extensive experiments on CLIP, OpenCLIP, SigLIP, SigLIP2, and DINOv2 show RegCache consistently boosts zero-shot classification and image-text retrieval, especially at 4-bit and 6-bit quantization, with minimal computational overhead. The work reveals that outliers in vision encoders are middle-layer phenomena with universal characteristics, and that appropriate prefixing of registers can significantly improve PTQ outcomes without retraining.

Abstract

Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm that mitigates outliers in large-scale pretrained vision encoders and serves as a plug-in module that can be applied on top of other quantization methods. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.

Activation Quantization of Vision Encoders Needs Prefixing Registers

TL;DR

The paper addresses the challenge of activating quantization in large vision encoders by mitigating mid-layer activation outliers. It introduces RegCache, a training-free method that curates universal middle-layer registers, caches their KV representations, and deletes outlier tokens to compress the activation range, improving post-training quantization across diverse ViTs. Extensive experiments on CLIP, OpenCLIP, SigLIP, SigLIP2, and DINOv2 show RegCache consistently boosts zero-shot classification and image-text retrieval, especially at 4-bit and 6-bit quantization, with minimal computational overhead. The work reveals that outliers in vision encoders are middle-layer phenomena with universal characteristics, and that appropriate prefixing of registers can significantly improve PTQ outcomes without retraining.

Abstract

Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose , a training-free algorithm that mitigates outliers in large-scale pretrained vision encoders and serves as a plug-in module that can be applied on top of other quantization methods. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.

Paper Structure

This paper contains 28 sections, 1 theorem, 6 equations, 8 figures, 14 tables.

Key Result

Lemma 1

Let $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$ and fix an index $i \in \{1,\dots,d\}$. Let $\mathbf{1}_i \in \mathbb{R}^d$ denote the one-hot vector whose $i$th entry is $1$ and all other entries are $0$. For $C \in \mathbb{R}^+$, define $\mathbf{x}' = \mathbf{x} + C \mathbf{1}_i,~ \mathbf{y}' = \mat Moreover, for $\mathbf{y}" = \mathbf{y} + C \mathbf{1}_j,$ where $j\neq i$, then the vectors become

Figures (8)

  • Figure 1: (Left) Sink tokens in LLMs vs. vision encoders. In LLMs, well-known sink tokens exist in a closed-set vocabulary. In contrast, vision encoders take image inputs composed of diverse patches that are continuously mapped into an embedding space, making the discovery of sink tokens more challenging. (Right) Activation magnitudes at the input of the 8th layer of CLIP-B/16, with and without RegCache (ImageNet-1k). RegCache discovers and inserts register token to quantization-sensitive layers, not as an input. This operation mitigates outliers, thereby narrowing the dynamic range and enabling more effective activation quantization under low bitwidths.
  • Figure 2: (Top) Layerwise quantization sensitivity (%). We plot the zero-shot ImageNet-1k accuracy of various vision encoders when we quantize only one layer to W8A8. (Bottom) Maximum norm of the FC2 layer input tokens for each layer. We plot the largest $\ell_\infty$-norm of all tokens in an image in a logarithmic scale, averaged over the ImageNet-1k validation set. The layer which is sensitive to quantization coincides with where activation outliers appear. For both plots, the x-axis denotes the index of the transformer block.
  • Figure 3: Emergence of outliers in foreground- and background-only images on SigLIP-B/16. Note that the curves for "original" and "BG-only" almost overlap.
  • Figure 4: Overview of the proposed method. We identify a universal register by analyzing the inputs of quantization-sensitive layers across blocks. During inference, the register is inserted into each block, and outlier tokens are removed from the most quantization-sensitive layer.
  • Figure 5: (Top) Layerwise quantization sensitivity (%). Zero-shot ImageNet-1k accuracy when we quantize only one layer to W8A8. (Bottom) Layerwise max token norms. The largest $\ell_\infty$-norm of all tokens in an image on a logarithmic scale, averaged over the ImageNet-1k validation set.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof