Table of Contents
Fetching ...

Estimation of embedding vectors in high dimensions

Golara Ahmadi Azar, Melika Emami, Alyson Fletcher, Sundeep Rangan

TL;DR

The paper addresses how well embeddings can be learned for discrete data pairs under a probabilistic model with an unknown true embedding and biases. It introduces a Poisson observation model and extends low-rank approximate message passing to a biased setting, formalized through state evolution to predict estimation accuracy in the large-system limit. The key contributions include the biased low-rank AMP algorithm, a rigorous SE analysis, and insights into sample efficiency and frequency effects (e.g., Zipf-like marginals) on embedding recovery, with quantitative predictions such as an inverse Fisher information parameter governing performance. The approach is validated on synthetic datasets and a real text dataset, demonstrating accurate SE predictions and offering a principled lens for understanding embedding learning in high dimensions.

Abstract

Embeddings are a basic initial feature extraction step in many machine learning models, particularly in natural language processing. An embedding attempts to map data tokens to a low-dimensional space where similar tokens are mapped to vectors that are close to one another by some metric in the embedding space. A basic question is how well can such embedding be learned? To study this problem, we consider a simple probability model for discrete data where there is some "true" but unknown embedding where the correlation of random variables is related to the similarity of the embeddings. Under this model, it is shown that the embeddings can be learned by a variant of low-rank approximate message passing (AMP) method. The AMP approach enables precise predictions of the accuracy of the estimation in certain high-dimensional limits. In particular, the methodology provides insight on the relations of key parameters such as the number of samples per value, the frequency of the terms, and the strength of the embedding correlation on the probability distribution. Our theoretical findings are validated by simulations on both synthetic data and real text data.

Estimation of embedding vectors in high dimensions

TL;DR

The paper addresses how well embeddings can be learned for discrete data pairs under a probabilistic model with an unknown true embedding and biases. It introduces a Poisson observation model and extends low-rank approximate message passing to a biased setting, formalized through state evolution to predict estimation accuracy in the large-system limit. The key contributions include the biased low-rank AMP algorithm, a rigorous SE analysis, and insights into sample efficiency and frequency effects (e.g., Zipf-like marginals) on embedding recovery, with quantitative predictions such as an inverse Fisher information parameter governing performance. The approach is validated on synthetic datasets and a real text dataset, demonstrating accurate SE predictions and offering a principled lens for understanding embedding learning in high dimensions.

Abstract

Embeddings are a basic initial feature extraction step in many machine learning models, particularly in natural language processing. An embedding attempts to map data tokens to a low-dimensional space where similar tokens are mapped to vectors that are close to one another by some metric in the embedding space. A basic question is how well can such embedding be learned? To study this problem, we consider a simple probability model for discrete data where there is some "true" but unknown embedding where the correlation of random variables is related to the similarity of the embeddings. Under this model, it is shown that the embeddings can be learned by a variant of low-rank approximate message passing (AMP) method. The AMP approach enables precise predictions of the accuracy of the estimation in certain high-dimensional limits. In particular, the methodology provides insight on the relations of key parameters such as the number of samples per value, the frequency of the terms, and the strength of the embedding correlation on the probability distribution. Our theoretical findings are validated by simulations on both synthetic data and real text data.
Paper Structure (29 sections, 8 theorems, 112 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 29 sections, 8 theorems, 112 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

Any fixed point of Algorithm alg:low-ramp is a local minimum of eq:LAB.

Figures (4)

  • Figure 1: Normalized loss (a) and MSE (b) vs iteration averaged over 20 instances, evaluated for an instance of the problem with $m=2000$, $n=3000$, $d=10$, and squared norm regularizers.
  • Figure 2: Normalized loss (a) and MSE (b) vs iteration averaged over 20 instances, evaluated for an instance of the problem with $m=2000$, $n=3000$, $d=10$, and L1 norm regularizers.
  • Figure 3: (a) Effect of individual biases on each element of $M$. As expected, we see an increasing trend of MSE with respect to $\Delta$. (b) The dominant singular values of $\widetilde{{Y}}$ are affected by $\Delta$. If $\Delta$ exceeds the critical value, the first $d$ singular values will not be distinguishable from the other singular values.
  • Figure 4: Loss function (a) and MSE (b) vs iteration when sampling from a real dataset.

Theorems & Definitions (10)

  • Lemma 1
  • Theorem 2
  • Definition 3
  • Definition 4
  • Lemma 5
  • Lemma 6
  • Lemma 7: SLLN for triangular arrays, Theorem 2 of hu1989strong
  • Lemma 8
  • Lemma 9
  • Theorem 10