Statistical Uncertainty in Word Embeddings: GloVe-V
Andrea Vallebueno, Cassandra Handan-Nader, Christopher D. Manning, Daniel E. Ho
TL;DR
This paper introduces GloVe-V, a scalable method to quantify reconstruction error uncertainty for GloVe word embeddings by modeling the embedding optimization as a probabilistic problem and deriving a multivariate normal distribution over word vectors. The approach yields per-word variance estimates and a principled delta-method or resampling-based route to propagate uncertainty into downstream statistics such as cosine similarity, neighbor rankings, model performance, and bias measures. Key contributions include the statistical foundations for reconstruction-variance in GloVe, demonstrations of how variance alters conclusions in word similarity, model comparison, and bias analyses, and public data release with pre-computed embeddings and variances. The method is computationally efficient relative to document bootstrap and provides a coherent framework for significance testing in NLP, though it is limited to words with sufficient co-occurrence context and currently tailored to the GloVe model. The authors also acknowledge potential extensions to transformer-based representations and other sources of uncertainty, outlining clear avenues for future work.
Abstract
Static word embeddings are ubiquitous in computational social science applications and contribute to practical decision-making in a variety of fields including law and healthcare. However, assessing the statistical uncertainty in downstream conclusions drawn from word embedding statistics has remained challenging. When using only point estimates for embeddings, researchers have no streamlined way of assessing the degree to which their model selection criteria or scientific conclusions are subject to noise due to sparsity in the underlying data used to generate the embeddings. We introduce a method to obtain approximate, easy-to-use, and scalable reconstruction error variance estimates for GloVe (Pennington et al., 2014), one of the most widely used word embedding models, using an analytical approximation to a multivariate normal model. To demonstrate the value of embeddings with variance (GloVe-V), we illustrate how our approach enables principled hypothesis testing in core word embedding tasks, such as comparing the similarity between different word pairs in vector space, assessing the performance of different models, and analyzing the relative degree of ethnic or gender bias in a corpus using different word lists.
