Advancing Similarity Search with GenAI: A Retrieval Augmented Generation Approach
Jean Bertin
TL;DR
The paper tackles semantic similarity search by leveraging Retrieval Augmented Generation to derive similarity scores via a conversational chain between a system and user prompt. It demonstrates that, on the BIOSSES biomedical dataset, a moderate temperature ($T=0.5$) and about 20 in-prompt examples yield a peak Pearson correlation of $r=0.905$, surpassing several traditional baselines. The work highlights the potential and challenges of using generative models for semantic retrieval, including computational cost, output-format sensitivity, and reproducibility concerns, while proposing avenues for prompt optimization and model diversity. This suggests a promising research direction where generative reasoning complements classical retrieval for nuanced semantic matching, with practical implications for biomedical text processing.
Abstract
This article introduces an innovative Retrieval Augmented Generation approach to similarity search. The proposed method uses a generative model to capture nuanced semantic information and retrieve similarity scores based on advanced context understanding. The study focuses on the BIOSSES dataset containing 100 pairs of sentences extracted from the biomedical domain, and introduces similarity search correlation results that outperform those previously attained on this dataset. Through an in-depth analysis of the model sensitivity, the research identifies optimal conditions leading to the highest similarity search accuracy: the results reveals high Pearson correlation scores, reaching specifically 0.905 at a temperature of 0.5 and a sample size of 20 examples provided in the prompt. The findings underscore the potential of generative models for semantic information retrieval and emphasize a promising research direction to similarity search.
