Gamma Mixture Modeling for Cosine Similarity in Small Language Models
Kevin Player
TL;DR
The work investigates how cosine similarities between sentence embeddings distribute, arguing that a shifted and truncated gamma model—and often a gamma mixture—provides a compact, data-efficient description of the distribution on $[-1,1]$. It develops an EM-based fitting procedure for shifted gamma mixtures and justifies the mixture form with a hierarchical topic-clustering intuition. Across multiple small-language-models and datasets, gamma models—sometimes as mixtures—consistently capture the empirical distributions, and a warm-start strategy dramatically speeds up fitting. The practical impact is a principled tail-based significance modeling tool for semantic search and cross-document matching, enabling reliable p-value-like assessments without large permutation datasets. The work also clarifies limitations of von Mises–Fisher approaches for this task and offers a scalable, interpretable alternative grounded in topic hierarchies.
Abstract
We study the cosine similarity of sentence transformer embeddings and observe that they are well modeled by gamma mixtures. From a fixed corpus, we measure similarities between all document embeddings and a reference query embedding. Empirically we find that these distributions are often well captured by a gamma distribution shifted and truncated to [-1,1], and in many cases, by a gamma mixture. We propose a heuristic model in which a hierarchical clustering of topics naturally leads to a gamma-mixture structure in the similarity scores. Finally, we outline an expectation-maximization algorithm for fitting shifted gamma mixtures, which provides a practical tool for modeling similarity distributions.
