Is It a Free Lunch for Removing Outliers during Pretraining?

Baohao Liao; Christof Monz

Is It a Free Lunch for Removing Outliers during Pretraining?

Baohao Liao, Christof Monz

TL;DR

Is It a Free Lunch for Removing Outliers during Pretraining? investigates whether removing outliers during pretraining improves quantization readiness of large language models and finds that naive outlier removal can harm full-precision transfer. It extends clipped softmax to a length-invariant normalization (NCS) to address sequence-length mismatch between pretraining and downstream tasks, enabling effective outlier-free pretraining of causal LLMs. The study shows that CS is not universally beneficial, but NCS improves FP16 quantization stability for BERT-like models and enables outlier-free pretraining for causal LLMs like OPT under certain conditions, with some scale-dependent limitations. This work advances quantization-friendly pretraining and informs when and how to apply outlier mitigation strategies for practical deployment of quantized LLMs.

Abstract

With the growing size of large language models, the role of quantization becomes increasingly significant. However, outliers present in weights or activations notably influence the performance of quantized models. Recently, \citet{qtransformer} introduced a novel softmax function aimed at pretraining models in an outlier-free manner, thereby enhancing their suitability for quantization. Interestingly, we observed that such an approach leads to performance degradation in full precision. Building on this insight, we enhance the method by ensuring its normalization is invariant to sequence length, a crucial factor for bridging the gap between pretraining and fine-tuning. Moreover, this improved method also facilitates successful pretraining of causal language models.

Is It a Free Lunch for Removing Outliers during Pretraining?

TL;DR

Abstract

Paper Structure (10 sections, 4 equations, 9 figures, 7 tables)

This paper contains 10 sections, 4 equations, 9 figures, 7 tables.

Introduction
Preliminaries
Removing Outliers Is not a Free Lunch
Results on BERT
Results on OPT
Conclusion
Experimental Setting
Pretraining
Evaluation
Quantization

Figures (9)

Figure 1: Common self-attention pattern of BERT-base bert. Refer to \ref{['fig: heatmap 01']}, \ref{['fig: heatmap 23']}, \ref{['fig: heatmap 45']}, \ref{['fig: heatmap 67']}, \ref{['fig: heatmap 89']}, \ref{['fig: heatmap 1011']} for more layers.
Figure 2: FP16 performance of different methods. Left: Finetuning results of BERT-base on GLUE. Right: MLM accuracy of pretrained BERT on the validation sets with different sequence lengths. BERT-small and BERT-base are pretrained with a max sequence length of 256 and 128, respectively. Refer to Table \ref{['tab: glue']} and \ref{['tab: val length']} for detailed numbers.
Figure 3: Pretraining BERT-small with various max sequence lengths. Refer to Table \ref{['tab: pretrained length']} for detailed numbers.
Figure 4: Self-attention pattern for BERT-base. Left: head #0. Right: head #1.
Figure 5: Self-attention pattern for BERT-base. Left: head #2. Right: head #3.
...and 4 more figures

Is It a Free Lunch for Removing Outliers during Pretraining?

TL;DR

Abstract

Is It a Free Lunch for Removing Outliers during Pretraining?

Authors

TL;DR

Abstract

Table of Contents

Figures (9)