Table of Contents
Fetching ...

Statistical Uncertainty in Word Embeddings: GloVe-V

Andrea Vallebueno, Cassandra Handan-Nader, Christopher D. Manning, Daniel E. Ho

TL;DR

This paper introduces GloVe-V, a scalable method to quantify reconstruction error uncertainty for GloVe word embeddings by modeling the embedding optimization as a probabilistic problem and deriving a multivariate normal distribution over word vectors. The approach yields per-word variance estimates and a principled delta-method or resampling-based route to propagate uncertainty into downstream statistics such as cosine similarity, neighbor rankings, model performance, and bias measures. Key contributions include the statistical foundations for reconstruction-variance in GloVe, demonstrations of how variance alters conclusions in word similarity, model comparison, and bias analyses, and public data release with pre-computed embeddings and variances. The method is computationally efficient relative to document bootstrap and provides a coherent framework for significance testing in NLP, though it is limited to words with sufficient co-occurrence context and currently tailored to the GloVe model. The authors also acknowledge potential extensions to transformer-based representations and other sources of uncertainty, outlining clear avenues for future work.

Abstract

Static word embeddings are ubiquitous in computational social science applications and contribute to practical decision-making in a variety of fields including law and healthcare. However, assessing the statistical uncertainty in downstream conclusions drawn from word embedding statistics has remained challenging. When using only point estimates for embeddings, researchers have no streamlined way of assessing the degree to which their model selection criteria or scientific conclusions are subject to noise due to sparsity in the underlying data used to generate the embeddings. We introduce a method to obtain approximate, easy-to-use, and scalable reconstruction error variance estimates for GloVe (Pennington et al., 2014), one of the most widely used word embedding models, using an analytical approximation to a multivariate normal model. To demonstrate the value of embeddings with variance (GloVe-V), we illustrate how our approach enables principled hypothesis testing in core word embedding tasks, such as comparing the similarity between different word pairs in vector space, assessing the performance of different models, and analyzing the relative degree of ethnic or gender bias in a corpus using different word lists.

Statistical Uncertainty in Word Embeddings: GloVe-V

TL;DR

This paper introduces GloVe-V, a scalable method to quantify reconstruction error uncertainty for GloVe word embeddings by modeling the embedding optimization as a probabilistic problem and deriving a multivariate normal distribution over word vectors. The approach yields per-word variance estimates and a principled delta-method or resampling-based route to propagate uncertainty into downstream statistics such as cosine similarity, neighbor rankings, model performance, and bias measures. Key contributions include the statistical foundations for reconstruction-variance in GloVe, demonstrations of how variance alters conclusions in word similarity, model comparison, and bias analyses, and public data release with pre-computed embeddings and variances. The method is computationally efficient relative to document bootstrap and provides a coherent framework for significance testing in NLP, though it is limited to words with sufficient co-occurrence context and currently tailored to the GloVe model. The authors also acknowledge potential extensions to transformer-based representations and other sources of uncertainty, outlining clear avenues for future work.

Abstract

Static word embeddings are ubiquitous in computational social science applications and contribute to practical decision-making in a variety of fields including law and healthcare. However, assessing the statistical uncertainty in downstream conclusions drawn from word embedding statistics has remained challenging. When using only point estimates for embeddings, researchers have no streamlined way of assessing the degree to which their model selection criteria or scientific conclusions are subject to noise due to sparsity in the underlying data used to generate the embeddings. We introduce a method to obtain approximate, easy-to-use, and scalable reconstruction error variance estimates for GloVe (Pennington et al., 2014), one of the most widely used word embedding models, using an analytical approximation to a multivariate normal model. To demonstrate the value of embeddings with variance (GloVe-V), we illustrate how our approach enables principled hypothesis testing in core word embedding tasks, such as comparing the similarity between different word pairs in vector space, assessing the performance of different models, and analyzing the relative degree of ethnic or gender bias in a corpus using different word lists.
Paper Structure (25 sections, 22 equations, 9 figures)

This paper contains 25 sections, 22 equations, 9 figures.

Figures (9)

  • Figure 1: Conceptual diagram of the Glove-V method for one word. The top two rows illustrate the structural form and estimation of the original GloVe model pennington2014, which models each row of a logged, weighted co-occurrence matrix as the product of a word vector and context vectors, plus constant terms. As shown in the third row, GloVe-V creates a distribution for the optimal GloVe word vector using the reconstruction error found through the GloVe minimization procedure. These distributions can be efficiently computed word-by-word by assuming conditional independence between words given the optimal context vectors and constants.
  • Figure 2: Uncertainty in word embedding locations. Two-dimensional representations of GloVe word embeddings trained on COHA (1900--1999), along with ellipses drawn around 100 draws from the estimated multivariate normal distribution from Equation \ref{['eq:prob']} for a random subset of words. Lower frequency words like "rigs" and "illumination" have more uncertainty in their estimated positions in the vector space than high frequency words like "she" and "large."
  • Figure 3: Word-level relationship between GloVe-V variances and frequency on COHA (1900--1999). L2-norm of the diagonal of $\hat{\mathbf{\Sigma}}$ from Equation \ref{['eq:sigma']} ($x$-axis, on a $\log_{10}$ scale) plotted against logged word frequencies ($y$-axis, on a $\log_{10}$ scale) for a subset of 5,000 words randomly sampled in proportion to word frequency. The variances for words colored in orange are computed as discussed in Section \ref{['sec:practical_estimation']}.
  • Figure 4: Comparison between document bootstrap and GloVe-V standard errors for cosine similarity. The average standard error of the cosine similarity between 1,600 randomly sampled word pairs ($y$-axis) as a function of the frequency for the word pair ($x$-axis with word frequency ranges in brackets), using the document bootstrap approach and Glove-V using the delta method. The GloVe-V standard errors are more sensitive to word frequency and are more efficient to compute.
  • Figure 5: Nearest neighbors with uncertainty. Healthcare occupations ($y$-axis) ranked by their cosine similarity with "doctor" ($x$-axis), with the nearest neighbor ranking based on the point estimate above each point, and $95\%$ GloVe-V uncertainty intervals.
  • ...and 4 more figures