Stable Anisotropic Regularization

William Rudman; Carsten Eickhoff

Stable Anisotropic Regularization

William Rudman, Carsten Eickhoff

TL;DR

This work challenges the NLP consensus that isotropy in contextualized embeddings is beneficial by introducing IsoScore*, a differentiable, mini-batch-stable measure of isotropy, and I-STAR, an anisotropy-based regularizer. I-STAR manipulates isotropy through a loss term $L_{I-STAR}=L_{CE}+\\lambda(1-\\text{IsoScore}^{\star}(\\tilde{X},\\zeta,\\Sigma_{S}))$, with shrinkage via $\\Sigma_{\\zeta}=\\zeta\\Sigma_{X}+(1-\\zeta)\\Sigma_{S}$ to stabilize covariance estimates. Across ALBERT, BERT, and DistilBERT on nine NLP tasks, the authors find that decreasing isotropy (\\lambda<0) generally improves performance, and that isotropy reductions correlate with a lower intrinsic dimensionality of activations, while isotropy increases correlate with worse performance. These results, supported by reproducibility resources, suggest a need to reassess prior NLP claims about isotropy and highlight the value of differentiable, stable isotropy measures for regularization in deep learning.

Abstract

Given the success of Large Language Models (LLMs), there has been considerable interest in studying the properties of model activations. The literature overwhelmingly agrees that LLM representations are dominated by a few "outlier dimensions" with exceedingly high variance and magnitude. Several studies in Natural Language Processing (NLP) have sought to mitigate the impact of such outlier dimensions and force LLMs to be isotropic (i.e., have uniform variance across all dimensions in embedding space). Isotropy is thought to be a desirable property for LLMs that improves model performance and more closely aligns textual representations with human intuition. However, many of the claims regarding isotropy in NLP have been based on the average cosine similarity of embeddings, which has recently been shown to be a flawed measure of isotropy. In this paper, we propose I-STAR: IsoScore*-based STable Anisotropic Regularization, a novel regularization method that can be used to increase or decrease levels of isotropy in embedding space during training. I-STAR uses IsoScore*, the first accurate measure of isotropy that is both differentiable and stable on mini-batch computations. In contrast to several previous works, we find that decreasing isotropy in contextualized embeddings improves performance on the majority of tasks and models considered in this paper.

Stable Anisotropic Regularization

TL;DR

, with shrinkage via

to stabilize covariance estimates. Across ALBERT, BERT, and DistilBERT on nine NLP tasks, the authors find that decreasing isotropy (\\lambda<0) generally improves performance, and that isotropy reductions correlate with a lower intrinsic dimensionality of activations, while isotropy increases correlate with worse performance. These results, supported by reproducibility resources, suggest a need to reassess prior NLP claims about isotropy and highlight the value of differentiable, stable isotropy measures for regularization in deep learning.

Abstract

Paper Structure (17 sections, 2 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 17 sections, 2 equations, 8 figures, 2 tables, 2 algorithms.

Introduction
Related work
Measuring isotropy
IsoScore$^{\star}$
Methods
Experimental design
Results
Discussion
Conclusion
Reproducibility
IsoScore vs IsoScore$^{\star}$
Why Do Models Prefer Anisotropy?
Layer-wise isotropy
Impact of the Shrinkage Parameter on I-STAR
Applying I-STAR to Different Layers
...and 2 more sections

Figures (8)

Figure 1: Forward pass of our I-STAR loss function. Let $x_{l}$ be the token embeddings in a mini-batch at layer $l \in \{1,2,...,n\}$, let $\Tilde{X} = \bigcup_{l=1}^{n}x_{l}$, let $\Sigma_{S_{i}}$ be the shrinkage covariance matrix for epoch $i$ and let $\zeta \in (0,1)$ be the shrinkage parameter. I-STAR loss is a weighted sum between cross-entropy loss, $L_{CE}$, and $\text{IsoScore}^{\star}(\Tilde{X}, \zeta, \Sigma_{S_{i}})$ where $\lambda$ is the tuning-parameter. Negative values of $\lambda$ correspond to decreasing isotropy in representations, and positive values of $\lambda$ encourage isotropy.
Figure 2: IsoScore$^{\star}(X,\zeta,\Sigma_{S})$ values for different choices of $\zeta$. The dashed line indicates the correct IsoScore$^{\star}$ value of $\bm{\bar{X}}$, which is IsoScore$^{\star}(\bm{\bar{X}})=0.86$. We calculate $\Sigma_{S}$ from a subsample $S \subset \bm{\bar{X}}$ such that $X \cap S = \emptyset$ and $|S|=75,000$.
Figure 3: Relationship between IsoScore* (x-axis) and model performance (y-axis). We fine-tune each model with I-STAR using the tuning parameters $\lambda \in \{\text{-}5, \text{-}3, \text{-}1, 0.50, 1, 3, 5\}$. We train each model over five random seeds and report the standard deviation of both performance and IsoScore$^{\star}(X, \zeta, \Sigma_{S})$ values. We set $\zeta=0.2$ for all computations of IsoScore$^{\star}$, and we compute $\Sigma_{S}$ from a random sample of 250,000 token embeddings from the training data.
Figure 4: Comparing the mean activation values on the validation data for each dimension of ALBERT, BERT, and DistilBERT fine-tuned on QNLI, with CosReg using a tuning-parameter value of $\lambda=-1,1$ and without any regularization. Trends are representative of all tasks.
Figure 5: TwoNN Intrinsic Dimensionality estimate of ALBERT, BERT, and DistilBERT sentence embeddings, i.e. [CLS] tokens, obtained from the SST-2 validation data for models fine-tuned on the SST-2 using I-STAR with tuning-parameters $\lambda \in \{-5, -3, 3, 5\}$. "Base" represents the case where no regularization is used. Trends are representative of all tasks.
...and 3 more figures

Stable Anisotropic Regularization

TL;DR

Abstract

Stable Anisotropic Regularization

Authors

TL;DR

Abstract

Table of Contents

Figures (8)