Table of Contents
Fetching ...

word2vec Parameter Learning Explained

Xin Rong

TL;DR

This note provides detailed derivations of the parameter update rules for the CBOW and Skip-Gram formulations of word2vec, including how gradients flow through input and output vector representations. It also covers optimization techniques for efficiency, notably hierarchical softmax and negative sampling, with explicit update rules and intuition for backpropagation. The explanations tie the math to how word co-occurrence guides vector movements and include an interactive visualization (wevi) to aid intuition. Together, these contributions clarify the learning dynamics behind word embeddings and practical ways to scale training.

Abstract

The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector representations of words learned by word2vec models have been shown to carry semantic meanings and are useful in various NLP tasks. As an increasing number of researchers would like to experiment with word2vec or similar techniques, I notice that there lacks a material that comprehensively explains the parameter learning process of word embedding models in details, thus preventing researchers that are non-experts in neural networks from understanding the working mechanism of such models. This note provides detailed derivations and explanations of the parameter update equations of the word2vec models, including the original continuous bag-of-word (CBOW) and skip-gram (SG) models, as well as advanced optimization techniques, including hierarchical softmax and negative sampling. Intuitive interpretations of the gradient equations are also provided alongside mathematical derivations. In the appendix, a review on the basics of neuron networks and backpropagation is provided. I also created an interactive demo, wevi, to facilitate the intuitive understanding of the model.

word2vec Parameter Learning Explained

TL;DR

This note provides detailed derivations of the parameter update rules for the CBOW and Skip-Gram formulations of word2vec, including how gradients flow through input and output vector representations. It also covers optimization techniques for efficiency, notably hierarchical softmax and negative sampling, with explicit update rules and intuition for backpropagation. The explanations tie the math to how word co-occurrence guides vector movements and include an interactive visualization (wevi) to aid intuition. Together, these contributions clarify the learning dynamics behind word embeddings and practical ways to scale training.

Abstract

The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector representations of words learned by word2vec models have been shown to carry semantic meanings and are useful in various NLP tasks. As an increasing number of researchers would like to experiment with word2vec or similar techniques, I notice that there lacks a material that comprehensively explains the parameter learning process of word embedding models in details, thus preventing researchers that are non-experts in neural networks from understanding the working mechanism of such models. This note provides detailed derivations and explanations of the parameter update equations of the word2vec models, including the original continuous bag-of-word (CBOW) and skip-gram (SG) models, as well as advanced optimization techniques, including hierarchical softmax and negative sampling. Intuitive interpretations of the gradient equations are also provided alongside mathematical derivations. In the appendix, a review on the basics of neuron networks and backpropagation is provided. I also created an interactive demo, wevi, to facilitate the intuitive understanding of the model.

Paper Structure

This paper contains 11 sections, 69 equations, 7 figures.

Figures (7)

  • Figure 1: A simple CBOW model with only one word in the context
  • Figure 2: Continuous bag-of-word model
  • Figure 3: The skip-gram model.
  • Figure 4: An example binary tree for the hierarchical softmax model. The white units are words in the vocabulary, and the dark units are inner units. An example path from root to $w_2$ is highlighted. In the example shown, the length of the path $L(w_2) = 4$. $n(w,j)$ means the $j$-th unit on the path from root to the word $w$.
  • Figure 5: An artificial neuron
  • ...and 2 more figures