Table of Contents
Fetching ...

On the Emergence of Linear Analogies in Word Embeddings

Daniel J. Korchinski, Dhruva Karkada, Yasaman Bahri, Matthieu Wyart

TL;DR

The paper addresses why word embeddings exhibit linear analogies by introducing a generative model in which each word is described by binary semantic attributes and co-occurrence statistics decompose as an independent-attribute interaction. It shows that the co-occurrence matrix $M$ has a Kronecker-product structure, yielding eigenvectors that are tensor products of per-attribute eigenvectors and eigenvalues that factorize across attributes, thereby making analogies emerge from simple attribute arithmetic. The results demonstrate that linear analogies arise naturally in both $M$ and $\log M$ embeddings, with robustness to noise, vocabulary pruning, and even removal of all pairs forming a given relation, and that PMI-based targets (as used in Glove) provide stronger and more stable analogy structure. Empirical validation on Wikipedia data aligns with the theory, indicating that the attribute-based spectral picture captures the essential mechanism behind the observed linear analogy phenomena in word embeddings. The work thus offers a principled, analytically tractable account of analogy structure and its dependence on embedding dimension and co-occurrence representations, with implications for interpretation of modern language models.

Abstract

Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.

On the Emergence of Linear Analogies in Word Embeddings

TL;DR

The paper addresses why word embeddings exhibit linear analogies by introducing a generative model in which each word is described by binary semantic attributes and co-occurrence statistics decompose as an independent-attribute interaction. It shows that the co-occurrence matrix has a Kronecker-product structure, yielding eigenvectors that are tensor products of per-attribute eigenvectors and eigenvalues that factorize across attributes, thereby making analogies emerge from simple attribute arithmetic. The results demonstrate that linear analogies arise naturally in both and embeddings, with robustness to noise, vocabulary pruning, and even removal of all pairs forming a given relation, and that PMI-based targets (as used in Glove) provide stronger and more stable analogy structure. Empirical validation on Wikipedia data aligns with the theory, indicating that the attribute-based spectral picture captures the essential mechanism behind the observed linear analogy phenomena in word embeddings. The work thus offers a principled, analytically tractable account of analogy structure and its dependence on embedding dimension and co-occurrence representations, with implications for interpretation of modern language models.

Abstract

Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability of words and in text corpora. The resulting vectors not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix , (ii) strengthens and then saturates as more eigenvectors of , which controls the dimension of the embeddings, are included, (iii) is enhanced when using rather than , and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.

Paper Structure

This paper contains 3 sections, 59 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: (a) A subset of the co-occurrence matrix for Wikipedia data, with labels drawn from the categories "country--capitals" and "noun--plural". (b) A subset of the co-occurrence matrix $M$ generated by our model ($d=8$, $s_k \sim \mathcal{N}(1/2,\sigma_s)$ with $\sigma_s = 10^{-3}$). Colors indicate value of $M_{ij}$ on a log-scale. (c) Averaged eigenvalue spectrum of the co-occurrence matrix $M$ for $d=8$, obtained from 50 random realizations of the semantic strength for $\sigma_s = 10^{-3}$, $2\times 10^{-2}$, and $10^{-1}$, the uniform $s_k \in (0,1)$, and empirical co-occurrence data. The inset reveals that the spectrum of the $\text{PMI}$ is not peaked with a density of nearly identical eigenvalues, as assumed in some previous works.
  • Figure 2: Analogy completion accuracy emerges at low embedding dimension.(a): Wikipedia analogy completion accuracy by analogy category for representations constructed from co-occurrences $M_{ij}$ with a vocabulary of 10,000 words. (b): Analogy accuracy for Wikipedia text, with different matrix targets $M_{ij}$ and $\log(M_{ij}+\varepsilon_R)$ (regularizer $\varepsilon=10^{-2}$). Shaded area indicates the sample standard deviation across analogy categories. (c): Analogy completion accuracy for a single realization of the model for $d=8$, for matrix target $M_{ij}$. (d): Analogy accuracy for the model with different matrix targets, averaged across all analogy tasks and 50 realizations of the $s_k$. Shading indicates standard deviation between realizations. (e): Analogy accuracy under the introduction of a multiplicative noise to each entry of the co-occurrence matrix with $P_{ij} \rightarrow P_{ij} \exp(\xi_{ij})$ for symmetric $\xi_{ij} \sim \mathcal{N}(0,\sigma_{\xi})$ averaged over 10 realizations of $s_k$ and noise $\xi$. (f): Analogy accuracy after both sparsifying the vocabulary and including a multiplicative noise ($\sigma_\xi = 10^{-1}$), retaining only a fraction $f = 0.15$ of words in $d=12$.
  • Figure 3: (a) Performance of the $\log(M)$ Wikipedia text co-occurrence matrix for analogy tasks. (b) As in (a), but having pruned from the co-occurrence matrix all co-occurrences of pairs matching the indicated analogy. The average, in black, reports the analogy accuracy when all analogies are pruned from the corpus. (c) Pruning of analogies from the co-occurrence matrix in the symmetric synthetic model over ten realizations of $s_k\in(0,1)$ disorder for $d=8$. Solid curves represent the average analogy performance on analogies involving the unpruned dimension, while the dashed curve reports performance on analogies that involve the dimension affected by pruning.
  • Figure 4: Analogy performance for narrowly distributed $s_k$ for different matrix targets in $d=8$. Shading represents the standard deviation across 50 replicates. Vertical lines at $K_1 = 1+ \binom{d}{1}$, $K_2 = K_1+\binom{d}{2}$ and $K_3=K_2+\binom{d}{3}$ mark the complete inclusion of the different eigenbands.
  • Figure 5: Analogy performance for uniformly distributed $s_k$ in different dimensions. Shading represents the standard deviation across 50 replicates. Vertical lines mark $K = 5$, $K=6$, $K=7$, and $K=8$, corresponding to the dimension of the semantic embedding space for the main curves.
  • ...and 4 more figures

Theorems & Definitions (1)

  • proof