Table of Contents
Fetching ...

Gamma Mixture Modeling for Cosine Similarity in Small Language Models

Kevin Player

TL;DR

The work investigates how cosine similarities between sentence embeddings distribute, arguing that a shifted and truncated gamma model—and often a gamma mixture—provides a compact, data-efficient description of the distribution on $[-1,1]$. It develops an EM-based fitting procedure for shifted gamma mixtures and justifies the mixture form with a hierarchical topic-clustering intuition. Across multiple small-language-models and datasets, gamma models—sometimes as mixtures—consistently capture the empirical distributions, and a warm-start strategy dramatically speeds up fitting. The practical impact is a principled tail-based significance modeling tool for semantic search and cross-document matching, enabling reliable p-value-like assessments without large permutation datasets. The work also clarifies limitations of von Mises–Fisher approaches for this task and offers a scalable, interpretable alternative grounded in topic hierarchies.

Abstract

We study the cosine similarity of sentence transformer embeddings and observe that they are well modeled by gamma mixtures. From a fixed corpus, we measure similarities between all document embeddings and a reference query embedding. Empirically we find that these distributions are often well captured by a gamma distribution shifted and truncated to [-1,1], and in many cases, by a gamma mixture. We propose a heuristic model in which a hierarchical clustering of topics naturally leads to a gamma-mixture structure in the similarity scores. Finally, we outline an expectation-maximization algorithm for fitting shifted gamma mixtures, which provides a practical tool for modeling similarity distributions.

Gamma Mixture Modeling for Cosine Similarity in Small Language Models

TL;DR

The work investigates how cosine similarities between sentence embeddings distribute, arguing that a shifted and truncated gamma model—and often a gamma mixture—provides a compact, data-efficient description of the distribution on . It develops an EM-based fitting procedure for shifted gamma mixtures and justifies the mixture form with a hierarchical topic-clustering intuition. Across multiple small-language-models and datasets, gamma models—sometimes as mixtures—consistently capture the empirical distributions, and a warm-start strategy dramatically speeds up fitting. The practical impact is a principled tail-based significance modeling tool for semantic search and cross-document matching, enabling reliable p-value-like assessments without large permutation datasets. The work also clarifies limitations of von Mises–Fisher approaches for this task and offers a scalable, interpretable alternative grounded in topic hierarchies.

Abstract

We study the cosine similarity of sentence transformer embeddings and observe that they are well modeled by gamma mixtures. From a fixed corpus, we measure similarities between all document embeddings and a reference query embedding. Empirically we find that these distributions are often well captured by a gamma distribution shifted and truncated to [-1,1], and in many cases, by a gamma mixture. We propose a heuristic model in which a hierarchical clustering of topics naturally leads to a gamma-mixture structure in the similarity scores. Finally, we outline an expectation-maximization algorithm for fitting shifted gamma mixtures, which provides a practical tool for modeling similarity distributions.

Paper Structure

This paper contains 16 sections, 19 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: The most significant matching of three sentence fragments in a summary(queries) with fragments in the document. Example from xsum dataset Narayan2018DontGM using pvalues modeled from $\texttt{all-MiniLM-L6-v2}$Wang2020MiniLMDS.
  • Figure 2: Example distribution $D_q$ for the abstract ‘Using Genetic Algorithms for Texts Classification Problems’ (arXiv dataset). A shifted gamma distribution provides a good fit.
  • Figure 3: An example of a $D_q$, $q$ in this case is the arXiv abstract for "Why Global Performance is a Poor Metric for Verifying Convergence of Multi-agent Learning" in the arXiv dataset. $D_q$ is fit well by a mixture of two gamma distributions.
  • Figure 4: Typical vMF distribution of cosine similarity ($d = 10$ and $\kappa = 10$).
  • Figure 5: Histogram and fitted gamma for $\eta = 0.95$ in Algorithm \ref{['alg']}. It is fit well by a single gamma.
  • ...and 3 more figures