Table of Contents
Fetching ...

Zipfian Whitening

Sho Yokoi, Han Bao, Hiroto Kurita, Hidetoshi Shimodaira

TL;DR

The theory corroborates that popular natural language processing methods, such as skip-gram negative sampling, WhiteningBERT, and headless language models work well just because their word embeddings encode the empirical word frequency into the underlying probabilistic model.

Abstract

The word embedding space in neural models is skewed, and correcting this can improve task performance. We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law. Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines. From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures. By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which becomes evident from the information-geometric perspective, and in terms of the loss functions for imbalanced classification. Additionally, our theory corroborates that popular natural language processing methods, such as skip-gram negative sampling, WhiteningBERT, and headless language models, work well just because their word embeddings encode the empirical word frequency into the underlying probabilistic model.

Zipfian Whitening

TL;DR

The theory corroborates that popular natural language processing methods, such as skip-gram negative sampling, WhiteningBERT, and headless language models work well just because their word embeddings encode the empirical word frequency into the underlying probabilistic model.

Abstract

The word embedding space in neural models is skewed, and correcting this can improve task performance. We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law. Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines. From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures. By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which becomes evident from the information-geometric perspective, and in terms of the loss functions for imbalanced classification. Additionally, our theory corroborates that popular natural language processing methods, such as skip-gram negative sampling, WhiteningBERT, and headless language models, work well just because their word embeddings encode the empirical word frequency into the underlying probabilistic model.

Paper Structure

This paper contains 39 sections, 2 theorems, 21 equations, 3 figures, 11 tables, 2 algorithms.

Key Result

Proposition 1

$\mathrm{Sym}_2(\boldsymbol v)$ takes values in $[0, 1]$, and $\mathrm{Sym}_2(\boldsymbol v) = 1$ if and only if $\boldsymbol v$ is isotropic around its barycenter (def:isotropic_position). Proof. Please refer to sec:proof_symmetry_second.

Figures (3)

  • Figure 1: Low-frequent words {} and high-frequent words {} are unevenly distributed in the embedding space mu2018allbutthetopGong2018-frageProvilkov2020-bpe-dropoutbis-too-much-in-common. Consequently, the "apparent" mean calculated by unweighted averaging often differs from the actual centroid .
  • Figure 2: The relationship between the 1st-order symmetry (\ref{['def:eval_zero_mean']}, $x$-axis), the 2nd-order symmetry (\ref{['def:eval_isotropic_position']}, $y$-axis), and task performance (color). Each point represents either pre-trained or post-processed word embeddings (GloVe, word2Vec, and fastText). The Zipfian measure well captures the downstream task performance (right), while the uniform isotropic measure cannot (left).
  • Figure 3: Relationships between the information content $-\log{p(w)}$ and the vector norms $\lVert\boldsymbol w\rVert_2$ for top 500 frequent words $w$. The figure in the center represents the pre-trained GloVe model. By using Zipfian whitening, the information content gets encoded in the norm (center to right). Conversely, with uniform whitening, this phenomenon does not occur (center to left).

Theorems & Definitions (8)

  • Definition 1: A random vector $\boldsymbol v\sim p$ on $\mathbb R^d$ has zero mean; the 1st moment of a symmetric random vector
  • Definition 2: A random vector $\boldsymbol v\sim p$ on $\mathbb R^d$ is in isotropic position around its barycenter; the 2nd moment of a symmetric random vector
  • Definition 3: Degree of centrality for the random vector $\boldsymbol{v} \sim p$; the 1st moment of symmetry
  • Definition 4: Degree of isotropy around the barycenter for the random vector $\boldsymbol{v} \sim p$; the 2nd moment of symmetry
  • Proposition 1
  • Theorem 1: The norm of a word vector learned with empirical Zipfian prior models reflect the information amount of the word; a refined version of Oyama2023-lp Eq. (12)
  • proof
  • proof