Whitening Not Recommended for Classification Tasks in LLMs

Ali Forooghi; Shaghayegh Sadeghi; Jianguo Lu

Whitening Not Recommended for Classification Tasks in LLMs

Ali Forooghi, Shaghayegh Sadeghi, Jianguo Lu

TL;DR

The paper systematically evaluates whitening as a post-processing technique for sentence embeddings from large language models, covering PCA, ZCA, PCA-Cor, ZCA-Cor, and Cholesky whitenings. It centers embeddings with $X-\mu$, decorrelates via the covariance $\Sigma = (X-\mu)(X-\mu)^T$, and applies a whitening matrix $W$ so that the transformed covariance is the identity $I$, e.g., PCA whitening with $W = U \Lambda^{-1/2} U^T$. Across eight models and seven datasets, whitening consistently degrades classification performance, while its impact on STS tasks is model-dependent; some LLMs show mild gains, but others do not. The isotropy-focused IsoScore analysis reveals that whitening increases isotropy, yet higher isotropy does not guarantee better downstream performance, with notable improvements in some vanilla models and limited or negative effects in fine-tuned ones like SBERT/SimCSE and ChatGPT. As a by-product, the work introduces SentEval$^+$, an embedding evaluation platform enabling affordable experimentation on commodity hardware, aiding practitioners and researchers to assess embedding quality across models and tasks.

Abstract

Sentence embedding is a cornerstone in NLP. Whitening has been claimed to be an effective operation to improve embedding quality obtained from Large Language Models (LLMs). However, we find that the efficacy of whitening is model-dependent and task-dependent. In particular, whitening degenerates embeddings for classification tasks. The conclusion is supported by extensive experiments. We also explored a variety of whitening operations, including PCA, ZCA, PCA-Cor, ZCA-Cor and Cholesky whitenings. A by-product of our research is embedding evaluation platform for LLMs called SentEval+.

Whitening Not Recommended for Classification Tasks in LLMs

TL;DR

, decorrelates via the covariance

, and applies a whitening matrix

so that the transformed covariance is the identity

, e.g., PCA whitening with

. Across eight models and seven datasets, whitening consistently degrades classification performance, while its impact on STS tasks is model-dependent; some LLMs show mild gains, but others do not. The isotropy-focused IsoScore analysis reveals that whitening increases isotropy, yet higher isotropy does not guarantee better downstream performance, with notable improvements in some vanilla models and limited or negative effects in fine-tuned ones like SBERT/SimCSE and ChatGPT. As a by-product, the work introduces SentEval

, an embedding evaluation platform enabling affordable experimentation on commodity hardware, aiding practitioners and researchers to assess embedding quality across models and tasks.

Abstract

Paper Structure (7 sections, 1 equation, 3 figures, 1 table, 1 algorithm)

This paper contains 7 sections, 1 equation, 3 figures, 1 table, 1 algorithm.

Introduction
Whitening Transformations
Experiments
Classification Task
STS Task
Impact of Whitening on Isotoropy
Conclusion

Figures (3)

Figure 1: Whitening leads to a deterioration in classification tasks (subplot A), but demonstrates improvements in STS tasks on some models (subplot B). The performance is the average of five whitenings, with shaded area indicating the range.
Figure 2: Visualization of embeddings before and after whitening. Dimensions are reduced using PCA.
Figure 3: Improvement in Isotropy measured with IsoScore due to Whitening on MR dataset.

Whitening Not Recommended for Classification Tasks in LLMs

TL;DR

Abstract

Whitening Not Recommended for Classification Tasks in LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (3)