Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems

Laura Caspari; Kanishka Ghosh Dastidar; Saber Zerhoudi; Jelena Mitrovic; Michael Granitzer

Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems

Laura Caspari, Kanishka Ghosh Dastidar, Saber Zerhoudi, Jelena Mitrovic, Michael Granitzer

TL;DR

This work tackles how to assess embedding model similarity for retrieval augmented generation (RAG) beyond single benchmark scores. It introduces two unsupervised evaluation axes—pairwise embedding similarity via Centered Kernel Alignment (CKA) and retrieval similarity via Jaccard and rank similarity of top-k retrieved chunks—applied to five BEIR datasets across 19 models. The findings reveal clear intra-family clustering with some inter-family patterns, yet retrieval overlap at small top-k is highly variable and often low, meaning embedding similarity does not predict retrieved content well. The study also identifies open-source alternatives that resemble proprietary models (e.g., Mistral to OpenAI models) but notes retrieval similarity remains dataset- and top-k dependent, underscoring the complexity of selecting embedding models for RAG in practice.

Abstract

The choice of embedding model is a crucial step in the design of Retrieval Augmented Generation (RAG) systems. Given the sheer volume of available options, identifying clusters of similar models streamlines this model selection process. Relying solely on benchmark performance scores only allows for a weak assessment of model similarity. Thus, in this study, we evaluate the similarity of embedding models within the context of RAG systems. Our assessment is two-fold: We use Centered Kernel Alignment to compare embeddings on a pair-wise level. Additionally, as it is especially pertinent to RAG systems, we evaluate the similarity of retrieval results between these models using Jaccard and rank similarity. We compare different families of embedding models, including proprietary ones, across five datasets from the popular Benchmark Information Retrieval (BEIR). Through our experiments we identify clusters of models corresponding to model families, but interestingly, also some inter-family clusters. Furthermore, our analysis of top-k retrieval similarity reveals high-variance at low k values. We also identify possible open-source alternatives to proprietary models, with Mistral exhibiting the highest similarity to OpenAI models.

Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems

TL;DR

Abstract

Paper Structure (11 sections, 4 equations, 6 figures, 2 tables)

This paper contains 11 sections, 4 equations, 6 figures, 2 tables.

Motivation
Related Work
Methods
Pair-wise Embedding Similarity
Retrieval Similarity
Experimental Setup
Results
Intra- and Inter-Family Clusters
Open Source Alternatives to Proprietary Models
Discussion
Conclusion

Figures (6)

Figure 1: Mean CKA similarity across all five datasets. Models tend to be most similar to models belonging to their own family, though some interesting inter-family patterns are visible as well.
Figure 2: Rank similarity over all $k$ on NFCorpus, comparing gte-large to all other models. Scores are highest and vary most for small $k$, but then drop quickly before stabilizing for larger $k$.
Figure 3: Jaccard similarity over all $k$ on NFCorpus, comparing bge-large (a) and gte-large (b) to all other models. While bge-large shows high similarity to UAE-Large-v1 and mxbai-embed-large-v1, scores for gte-large are clustered much closer. Jaccard similarity seems to be most unstable for small values of $k$, which would commonly be chosen for retrieval tasks.
Figure 4: Jaccard (a) and rank similarity (b) for the top-10 retrieved text chunks averaged over 25 queries on NFCorpus. The clusters vary slightly depending on the measure, as do the scores. Models tend to be most similar to models from their own family. However, some inter-family clusters are visible as well.
Figure 5: Jaccard similarity for the top-10 retrieved text chunks averaged over 25 queries on SciFact (a) and ArguAna (b). The UAE and mxbai models show high levels of similarity along with bge-large. The remaining models tend to show the highest similarity within their own family with the exception of the bge/gte inter-family cluster.
...and 1 more figures

Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems

TL;DR

Abstract

Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (6)