Is It a Free Lunch for Removing Outliers during Pretraining?
Baohao Liao, Christof Monz
TL;DR
Is It a Free Lunch for Removing Outliers during Pretraining? investigates whether removing outliers during pretraining improves quantization readiness of large language models and finds that naive outlier removal can harm full-precision transfer. It extends clipped softmax to a length-invariant normalization (NCS) to address sequence-length mismatch between pretraining and downstream tasks, enabling effective outlier-free pretraining of causal LLMs. The study shows that CS is not universally beneficial, but NCS improves FP16 quantization stability for BERT-like models and enables outlier-free pretraining for causal LLMs like OPT under certain conditions, with some scale-dependent limitations. This work advances quantization-friendly pretraining and informs when and how to apply outlier mitigation strategies for practical deployment of quantized LLMs.
Abstract
With the growing size of large language models, the role of quantization becomes increasingly significant. However, outliers present in weights or activations notably influence the performance of quantized models. Recently, \citet{qtransformer} introduced a novel softmax function aimed at pretraining models in an outlier-free manner, thereby enhancing their suitability for quantization. Interestingly, we observed that such an approach leads to performance degradation in full precision. Building on this insight, we enhance the method by ensuring its normalization is invariant to sequence length, a crucial factor for bridging the gap between pretraining and fine-tuning. Moreover, this improved method also facilitates successful pretraining of causal language models.
