On the Emergence of Linear Analogies in Word Embeddings

Daniel J. Korchinski; Dhruva Karkada; Yasaman Bahri; Matthieu Wyart

On the Emergence of Linear Analogies in Word Embeddings

Daniel J. Korchinski, Dhruva Karkada, Yasaman Bahri, Matthieu Wyart

TL;DR

The paper addresses why word embeddings exhibit linear analogies by introducing a generative model in which each word is described by binary semantic attributes and co-occurrence statistics decompose as an independent-attribute interaction. It shows that the co-occurrence matrix $M$ has a Kronecker-product structure, yielding eigenvectors that are tensor products of per-attribute eigenvectors and eigenvalues that factorize across attributes, thereby making analogies emerge from simple attribute arithmetic. The results demonstrate that linear analogies arise naturally in both $M$ and $\log M$ embeddings, with robustness to noise, vocabulary pruning, and even removal of all pairs forming a given relation, and that PMI-based targets (as used in Glove) provide stronger and more stable analogy structure. Empirical validation on Wikipedia data aligns with the theory, indicating that the attribute-based spectral picture captures the essential mechanism behind the observed linear analogy phenomena in word embeddings. The work thus offers a principled, analytically tractable account of analogy structure and its dependence on embedding dimension and co-occurrence representations, with implications for interpretation of modern language models.

Abstract

Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.

On the Emergence of Linear Analogies in Word Embeddings

TL;DR

has a Kronecker-product structure, yielding eigenvectors that are tensor products of per-attribute eigenvectors and eigenvalues that factorize across attributes, thereby making analogies emerge from simple attribute arithmetic. The results demonstrate that linear analogies arise naturally in both

and

embeddings, with robustness to noise, vocabulary pruning, and even removal of all pairs forming a given relation, and that PMI-based targets (as used in Glove) provide stronger and more stable analogy structure. Empirical validation on Wikipedia data aligns with the theory, indicating that the attribute-based spectral picture captures the essential mechanism behind the observed linear analogy phenomena in word embeddings. The work thus offers a principled, analytically tractable account of analogy structure and its dependence on embedding dimension and co-occurrence representations, with implications for interpretation of modern language models.

Abstract

Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability

of words

and

in text corpora. The resulting vectors

not only group semantically similar words but also exhibit a striking linear analogy structure -- for example,

-- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix

, (ii) strengthens and then saturates as more eigenvectors of

, which controls the dimension of the embeddings, are included, (iii) is enhanced when using

rather than

, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.

On the Emergence of Linear Analogies in Word Embeddings

TL;DR

Abstract

On the Emergence of Linear Analogies in Word Embeddings

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)

Theorems & Definitions (1)