Global dense vector representations for words or items using shared parameter alternating Tweedie model

Taejoon Kim; Haiyan Wang

Global dense vector representations for words or items using shared parameter alternating Tweedie model

Taejoon Kim, Haiyan Wang

TL;DR

The paper develops the Shared-parameter Alternating Tweedie (SA-Tweedie) model to learn global dense representations from high-dimensional, sparse co-occurrence data by modeling counts with Tweedie distributions in a shared-parameter, alternating regression framework. It derives IRLS-style Fisher-scoring updates for alternating blocks of row and column parameters, discusses learning-rate stabilization, and addresses practical challenges in estimating Tweedie power $p$ and dispersion $\phi$ via a piecewise mean–variance approach. Through small-scale simulations and large-scale scalability experiments (e.g., Wikipedia-scale co-occurrence and CoNLL-2003 NER), the method demonstrates competitive performance with significantly smaller models compared to transformer-based embeddings and shows practical strategies for efficient data handling, including CPU-GPU parallelism and database-backed data retrieval. The results suggest SA-Tweedie embeddings offer a principled probabilistic alternative for generating robust global word representations with strong applicability to NLP tasks and recommender-style problems, while highlighting future directions in adaptive fine-tuning and multimodal integration.

Abstract

In this article, we present a model for analyzing the cooccurrence count data derived from practical fields such as user-item or item-item data from online shopping platform, cooccurring word-word pairs in sequences of texts. Such data contain important information for developing recommender systems or studying relevance of items or words from non-numerical sources. Different from traditional regression models, there are no observations for covariates. Additionally, the cooccurrence matrix is typically of so high dimension that it does not fit into a computer's memory for modeling. We extract numerical data by defining windows of cooccurrence using weighted count on the continuous scale. Positive probability mass is allowed for zero observations. We present Shared parameter Alternating Tweedie (SA-Tweedie) model and an algorithm to estimate the parameters. We introduce a learning rate adjustment used along with the Fisher scoring method in the inner loop to help the algorithm stay on track of optimizing direction. Gradient descent with Adam update was also considered as an alternative method for the estimation. Simulation studies and an application showed that our algorithm with Fisher scoring and learning rate adjustment outperforms the other two methods. Pseudo-likelihood approach with alternating parameter update was also studied. Numerical studies showed that the pseudo-likelihood approach is not suitable in our shared parameter alternating regression models with unobserved covariates.

Global dense vector representations for words or items using shared parameter alternating Tweedie model

TL;DR

and dispersion

via a piecewise mean–variance approach. Through small-scale simulations and large-scale scalability experiments (e.g., Wikipedia-scale co-occurrence and CoNLL-2003 NER), the method demonstrates competitive performance with significantly smaller models compared to transformer-based embeddings and shows practical strategies for efficient data handling, including CPU-GPU parallelism and database-backed data retrieval. The results suggest SA-Tweedie embeddings offer a principled probabilistic alternative for generating robust global word representations with strong applicability to NLP tasks and recommender-style problems, while highlighting future directions in adaptive fine-tuning and multimodal integration.

Abstract

Paper Structure (9 sections, 35 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 9 sections, 35 equations, 12 figures, 3 tables, 1 algorithm.

Introduction
The probability distribution for the proposed SA-Tweedie model
MLE for alternating Tweedie regression
Impact of the parameters $p$ and $\phi$
A small simulation study
Scalability & application to Named Entity Recognition
Scalability to data with large vocabulary size & training corpus
Application to NER task on CoNLL-2003 data
Summary, discussion, and future research

Figures (12)

Figure 1: Illustration of model input and desired output. Left panel: Model input - the natural log of (weighted occurrence count +1) matrix for top 300 words from Reuter Business news data. Right panel: Shared parameter Tweedie modeling process and output
Figure 2: Computed log(loss) and log(overall loss) from simulated dataset using the Fisher scoring with or without learning rate adjustment, and gradient descent algorithm with Adam method for parameter update. The left panel depicts how the loss changes over 10 epochs for one row of the parameter update. As epoch number grows, the loss has a general decreasing trend but the Adam's loss has higher values and reduces slower than the other two updates. The right panel is for overall loss versus number of iterations in $\log$ scale. All losses decrease as the iteration number increases but the Adam update has higher values of the overall loss.
Figure 3: Relationship between log of sample mean and log of sample variance from Wikipedia data with 50K vocabulary size. The three lines in each interval are the fitted linear regression line and upper and lower bounds with same slope.
Figure 4: The loss reduction was compared within epochs among three different updates: the alternating Tweedie regression algorithm with and without learning rate adjustment, and Adam update. The results are from the first iteration and first row of data matrix in our Algorithm \ref{['algorithm:AlternatingTweedie']}.
Figure 5: The overall loss over iterations among three different update methods: with or without learning rate adjustment and the Adam update. The Fisher scoring type update with or without learning rate adjustment started with lower overall loss than the Adam update, and reduces the overall loss faster as the iteration number increases.
...and 7 more figures

Global dense vector representations for words or items using shared parameter alternating Tweedie model

TL;DR

Abstract

Global dense vector representations for words or items using shared parameter alternating Tweedie model

Authors

TL;DR

Abstract

Table of Contents

Figures (12)