Table of Contents
Fetching ...

COS-Mix: Cosine Similarity and Distance Fusion for Improved Information Retrieval

Kush Juvekar, Anupam Purwar

TL;DR

The paper tackles information retrieval challenges in Retrieval-Augmented Generation (RAG) for proprietary, knowledge-intensive data. It introduces COS-Mix, a hybrid retrieval strategy that fuses cosine similarity with cosine distance to better capture semantic relationships, supplemented by BM25 and dense-vector retrieval and a distance-based fallback. Empirical results on proprietary data show that the distance-based augmentation improves retrieval quality and answer generation for sparse information, while maintaining efficiency and reducing reliance on very large context windows. The work provides a practical enterprise-oriented retrieval framework that can adapt to task sparsity and demonstrates how distance-based ranking can complement traditional similarity-based methods to enhance RAG performance.

Abstract

This study proposes a novel hybrid retrieval strategy for Retrieval-Augmented Generation (RAG) that integrates cosine similarity and cosine distance measures to improve retrieval performance, particularly for sparse data. The traditional cosine similarity measure is widely used to capture the similarity between vectors in high-dimensional spaces. However, it has been shown that this measure can yield arbitrary results in certain scenarios. To address this limitation, we incorporate cosine distance measures to provide a complementary perspective by quantifying the dissimilarity between vectors. Our approach is experimented on proprietary data, unlike recent publications that have used open-source datasets. The proposed method demonstrates enhanced retrieval performance and provides a more comprehensive understanding of the semantic relationships between documents or items. This hybrid strategy offers a promising solution for efficiently and accurately retrieving relevant information in knowledge-intensive applications, leveraging techniques such as BM25 (sparse) retrieval , vector (Dense) retrieval, and cosine distance based retrieval to facilitate efficient information retrieval.

COS-Mix: Cosine Similarity and Distance Fusion for Improved Information Retrieval

TL;DR

The paper tackles information retrieval challenges in Retrieval-Augmented Generation (RAG) for proprietary, knowledge-intensive data. It introduces COS-Mix, a hybrid retrieval strategy that fuses cosine similarity with cosine distance to better capture semantic relationships, supplemented by BM25 and dense-vector retrieval and a distance-based fallback. Empirical results on proprietary data show that the distance-based augmentation improves retrieval quality and answer generation for sparse information, while maintaining efficiency and reducing reliance on very large context windows. The work provides a practical enterprise-oriented retrieval framework that can adapt to task sparsity and demonstrates how distance-based ranking can complement traditional similarity-based methods to enhance RAG performance.

Abstract

This study proposes a novel hybrid retrieval strategy for Retrieval-Augmented Generation (RAG) that integrates cosine similarity and cosine distance measures to improve retrieval performance, particularly for sparse data. The traditional cosine similarity measure is widely used to capture the similarity between vectors in high-dimensional spaces. However, it has been shown that this measure can yield arbitrary results in certain scenarios. To address this limitation, we incorporate cosine distance measures to provide a complementary perspective by quantifying the dissimilarity between vectors. Our approach is experimented on proprietary data, unlike recent publications that have used open-source datasets. The proposed method demonstrates enhanced retrieval performance and provides a more comprehensive understanding of the semantic relationships between documents or items. This hybrid strategy offers a promising solution for efficiently and accurately retrieving relevant information in knowledge-intensive applications, leveraging techniques such as BM25 (sparse) retrieval , vector (Dense) retrieval, and cosine distance based retrieval to facilitate efficient information retrieval.
Paper Structure (11 sections, 1 figure, 1 table, 2 algorithms)

This paper contains 11 sections, 1 figure, 1 table, 2 algorithms.

Figures (1)

  • Figure 1: COS-Mix LLM Interface