On SkipGram Word Embedding Models with Negative Sampling: Unified Framework and Impact of Noise Distributions
Dezhi Liu, Richong Zhang, Ziqiao Wang
TL;DR
The paper addresses the theoretical and practical gaps in negative sampling for word embeddings by proposing Word-Context Classification (WCC), a framework that generalizes SGN to arbitrary noise distributions. It develops adaptive noise mechanisms (caSGN) and ACE, derives PMI-based interpretations under certain conditions, and demonstrates that noise distributions aligning with the data distribution yield superior embeddings and faster convergence in experiments. The key contributions include a unifying theory for SGN-like models, a spectrum of noise-distribution variants, and empirical evidence that data-distribution noise offers performance and training-time benefits, along with novel models that surpass existing WCC variants. These insights offer a principled path toward principled, adaptable embedding learning and inform future work on Transformer-based or multimodal extensions within the WCC paradigm.
Abstract
SkipGram word embedding models with negative sampling, or SGN in short, is an elegant family of word embedding models. In this paper, we formulate a framework for word embedding, referred to as Word-Context Classification (WCC), that generalizes SGN to a wide family of models. The framework, which uses some ``noise examples'', is justified through theoretical analysis. The impact of noise distribution on the learning of the WCC embedding models is studied experimentally, suggesting that the best noise distribution is, in fact, the data distribution, in terms of both the embedding performance and the speed of convergence during training. Along our way, we discover several novel embedding models that outperform existing WCC models.
