Table of Contents
Fetching ...

Sparse Overcomplete Word Vector Representations

Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, Noah Smith

TL;DR

The paper addresses the interpretability gap between dense word vectors and lexical semantic theories by introducing sparse overcomplete transformations that produce lengthy, sparse (and optionally binary) representations learned from raw corpora. It presents two methods—sparse coding (A) and nonnegative sparse coding with binarization (B)—and demonstrates via extensive benchmarks and a word intrusion study that these transformed vectors generally outperform the original vectors and are more interpretable. The approach relies on AdaGrad optimization, nonnegativity constraints, and a careful hyperparameter grid search to balance sparsity and performance. Overall, the work offers a principled pathway to obtain interpretable, task-robust word representations that can serve as discrete-style features for NLP models, with code released for public use.

Abstract

Current distributed representations of words show little resemblance to theories of lexical semantics. The former are dense and uninterpretable, the latter largely based on familiar, discrete classes (e.g., supersenses) and relations (e.g., synonymy and hypernymy). We propose methods that transform word vectors into sparse (and optionally binary) vectors. The resulting representations are more similar to the interpretable features typically used in NLP, though they are discovered automatically from raw corpora. Because the vectors are highly sparse, they are computationally easy to work with. Most importantly, we find that they outperform the original vectors on benchmark tasks.

Sparse Overcomplete Word Vector Representations

TL;DR

The paper addresses the interpretability gap between dense word vectors and lexical semantic theories by introducing sparse overcomplete transformations that produce lengthy, sparse (and optionally binary) representations learned from raw corpora. It presents two methods—sparse coding (A) and nonnegative sparse coding with binarization (B)—and demonstrates via extensive benchmarks and a word intrusion study that these transformed vectors generally outperform the original vectors and are more interpretable. The approach relies on AdaGrad optimization, nonnegativity constraints, and a careful hyperparameter grid search to balance sparsity and performance. Overall, the work offers a principled pathway to obtain interpretable, task-robust word representations that can serve as discrete-style features for NLP models, with code released for public use.

Abstract

Current distributed representations of words show little resemblance to theories of lexical semantics. The former are dense and uninterpretable, the latter largely based on familiar, discrete classes (e.g., supersenses) and relations (e.g., synonymy and hypernymy). We propose methods that transform word vectors into sparse (and optionally binary) vectors. The resulting representations are more similar to the interpretable features typically used in NLP, though they are discovered automatically from raw corpora. Because the vectors are highly sparse, they are computationally easy to work with. Most importantly, we find that they outperform the original vectors on benchmark tasks.

Paper Structure

This paper contains 27 sections, 9 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Methods for obtaining sparse overcomplete vectors (top, method A, §\ref{['sec:sparse-coding']}) and sparse, binary overcomplete word vectors (bottom, method B, §\ref{['sec:nonneg']} and §\ref{['sec:binary']}). Observed dense vectors of length $L$ (left) are converted to sparse non-negative vectors (center) of length $K$ which are then projected into the binary vector space (right), where $L \ll K$. $\mathbf{X}$ is dense, $\mathbf{A}$ is sparse, and $\mathbf{B}$ is the binary word vector matrix. Strength of colors signify the magnitude of values; negative is red, positive is blue, and zero is white.
  • Figure 2: Average performace across all tasks for sparse overcomplete vectors ($\mathbf{A}$) produced by Glove initial vectors, as a function of the ratio of $K$ to $L$.
  • Figure 3: Visualization of sparsified GC vectors. Negative values are red, positive values are blue, zeroes are white.