Table of Contents
Fetching ...

Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models

Dhruva Karkada, James B. Simon, Yasaman Bahri, Michael R. DeWeese

TL;DR

The paper presents Quadratic Word Embedding Models (QWEM) as a tractable proxy for word2vec by analyzing the quartic Maclaurin approximation of the SGNS loss. It derives a closed-form gradient-flow solution showing that learning proceeds via sequential, rank-incrementing learning of orthogonal subspaces characterized by the top eigen-directions of the target ${\bm{M}}^{*}$, with explicit timescales ${\tau_k}$. Empirical validation on a Wikipedia corpus demonstrates that QWEMs reproduce word2vec dynamics, features, and downstream analogies, and the authors connect the formation of linear semantic representations to random-matrix theory (spiked models and Marchenko-Pastur spectra). The work provides a predictive, interpretable theory of feature learning in self-supervised language models and suggests that linear semantic structure emerges early as a consequence of the optimization dynamics and data statistics.

Abstract

Self-supervised word embedding algorithms such as word2vec provide a minimal setting for studying representation learning in language modeling. We examine the quartic Taylor approximation of the word2vec loss around the origin, and we show that both the resulting training dynamics and the final performance on downstream tasks are empirically very similar to those of word2vec. Our main contribution is to analytically solve for both the gradient flow training dynamics and the final word embeddings in terms of only the corpus statistics and training hyperparameters. The solutions reveal that these models learn orthogonal linear subspaces one at a time, each one incrementing the effective rank of the embeddings until model capacity is saturated. Training on Wikipedia, we find that each of the top linear subspaces represents an interpretable topic-level concept. Finally, we apply our theory to describe how linear representations of more abstract semantic concepts emerge during training; these can be used to complete analogies via vector addition.

Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models

TL;DR

The paper presents Quadratic Word Embedding Models (QWEM) as a tractable proxy for word2vec by analyzing the quartic Maclaurin approximation of the SGNS loss. It derives a closed-form gradient-flow solution showing that learning proceeds via sequential, rank-incrementing learning of orthogonal subspaces characterized by the top eigen-directions of the target , with explicit timescales . Empirical validation on a Wikipedia corpus demonstrates that QWEMs reproduce word2vec dynamics, features, and downstream analogies, and the authors connect the formation of linear semantic representations to random-matrix theory (spiked models and Marchenko-Pastur spectra). The work provides a predictive, interpretable theory of feature learning in self-supervised language models and suggests that linear semantic structure emerges early as a consequence of the optimization dynamics and data statistics.

Abstract

Self-supervised word embedding algorithms such as word2vec provide a minimal setting for studying representation learning in language modeling. We examine the quartic Taylor approximation of the word2vec loss around the origin, and we show that both the resulting training dynamics and the final performance on downstream tasks are empirically very similar to those of word2vec. Our main contribution is to analytically solve for both the gradient flow training dynamics and the final word embeddings in terms of only the corpus statistics and training hyperparameters. The solutions reveal that these models learn orthogonal linear subspaces one at a time, each one incrementing the effective rank of the embeddings until model capacity is saturated. Training on Wikipedia, we find that each of the top linear subspaces represents an interpretable topic-level concept. Finally, we apply our theory to describe how linear representations of more abstract semantic concepts emerge during training; these can be used to complete analogies via vector addition.

Paper Structure

This paper contains 38 sections, 5 theorems, 29 equations, 9 figures.

Key Result

theorem 1

Under asm:reweight, the contrastive loss eq:qwem_loss can be rewritten as the unweighted matrix factorization problem If ${\bm{\Lambda}}_{[:d,:d]}$ is positive semidefinite, then the set of global minima of $\mathcal{L}$ is given by

Figures (9)

  • Figure 1: Quadratic word embedding models are a faithful and analytically solvable proxy for word2vec. We compare the time course of learning in QWEMs (top) and word2vec (bottom), finding striking similarities in their training dynamics and learned representations. Analytically, we solve for the optimization dynamics of QWEMs under gradient flow from small initialization, revealing discrete, rank-incrementing learning steps corresponding to stepwise decreases in the loss (top left). In latent space (right side plots), embedding vectors expand into subspaces of increasing dimension at each learning step. These PCA directions are the model's learned features, and they can be extracted from our theory in closed form given only the corpus statistics and hyperparameters (\ref{['thm:matrixfac']}). Empirically, QWEMs yield high-quality embeddings very similar to word2vec's in terms of their learned features and performance on benchmarks (\ref{['fig:comparisons']}). See \ref{['appdx:experiments']} for details.
  • Figure 2: Theory matches experiment. We make two simplifications to the word2vec algorithm: a quartic approximation of the loss, and a restriction on the reweighting hyperparameters. We train these QWEMs on 2 billion tokens of English Wikipedia (see \ref{['appdx:experiments']} for details) and compare to word2vec. We find good qualitative match in the singular value dynamics, both with the standard word2vec initialization scheme and with small random initialization. (For evidence that the singular vectors match as well, see \ref{['fig:comparisons']}.) We compare the dynamics to the prediction of \ref{['thm:sri']}, which is derived in the vanishing initialization limit with full-batch gradient flow. Even though the experiment uses stochastic mini-batching, non-vanishing learning rate, and large initialization, we find excellent agreement even up to constant factors.
  • Figure 3: Words with smallest cosine distance to embedding principal components
  • Figure 4: Performance on standard word embedding benchmarks
  • Figure 6: Models build linear representations from a few informative and many noisy eigen-features. In the left and upper plots, we examine task vectors between verb past tenses and their participle (e.g., $\mathbf{went}-\mathbf{going}$). In \ref{['appdx:more']} we show that these observations hold for other semantic binaries. (Left.) The spectrum of the Gram matrix (histogram) is well-described by a Marchenko-Pastur distribution (orange) plus an outlier "spike," across model sizes $d$. See \ref{['appdx:taskvecs']} for details. (Top.) The spike corresponds to the average task vector, which comprises a few dominant eigen-features. Many of these features correspond to concepts related to history or temporal change, consistent with this semantic category. (Bottom.) We measure the strength of the spike across model size $d$ for various semantic categories. We find that the spike strength correlates strongly with the model's ability to use the task vectors for analogy completion.
  • ...and 4 more figures

Theorems & Definitions (5)

  • theorem 1: QWEM = unweighted matrix factorization
  • proposition 1
  • lemma 1: Training dynamics, aligned initialization
  • theorem 1: QWEM = unweighted matrix factorization
  • proposition 1