Advancing Similarity Search with GenAI: A Retrieval Augmented Generation Approach

Jean Bertin

Advancing Similarity Search with GenAI: A Retrieval Augmented Generation Approach

Jean Bertin

TL;DR

The paper tackles semantic similarity search by leveraging Retrieval Augmented Generation to derive similarity scores via a conversational chain between a system and user prompt. It demonstrates that, on the BIOSSES biomedical dataset, a moderate temperature ($T=0.5$) and about 20 in-prompt examples yield a peak Pearson correlation of $r=0.905$, surpassing several traditional baselines. The work highlights the potential and challenges of using generative models for semantic retrieval, including computational cost, output-format sensitivity, and reproducibility concerns, while proposing avenues for prompt optimization and model diversity. This suggests a promising research direction where generative reasoning complements classical retrieval for nuanced semantic matching, with practical implications for biomedical text processing.

Abstract

This article introduces an innovative Retrieval Augmented Generation approach to similarity search. The proposed method uses a generative model to capture nuanced semantic information and retrieve similarity scores based on advanced context understanding. The study focuses on the BIOSSES dataset containing 100 pairs of sentences extracted from the biomedical domain, and introduces similarity search correlation results that outperform those previously attained on this dataset. Through an in-depth analysis of the model sensitivity, the research identifies optimal conditions leading to the highest similarity search accuracy: the results reveals high Pearson correlation scores, reaching specifically 0.905 at a temperature of 0.5 and a sample size of 20 examples provided in the prompt. The findings underscore the potential of generative models for semantic information retrieval and emphasize a promising research direction to similarity search.

Advancing Similarity Search with GenAI: A Retrieval Augmented Generation Approach

TL;DR

) and about 20 in-prompt examples yield a peak Pearson correlation of

, surpassing several traditional baselines. The work highlights the potential and challenges of using generative models for semantic retrieval, including computational cost, output-format sensitivity, and reproducibility concerns, while proposing avenues for prompt optimization and model diversity. This suggests a promising research direction where generative reasoning complements classical retrieval for nuanced semantic matching, with practical implications for biomedical text processing.

Abstract

Paper Structure (15 sections, 1 equation, 4 figures, 1 table)

This paper contains 15 sections, 1 equation, 4 figures, 1 table.

Introduction
METHOD
Presentation of the BIOSSES dataset
Calculation metric for similarity
Prompt engineering for similarity search case
Implementation of the conversational chain
Iterate on test dataset
RESULTS
Sensibility to temperature parameter
Influence of the number of examples given to the prompt
Cross-factor analysis
DISCUSSION
Limitations and constraints
Areas for improvement
CONCLUSION

Figures (4)

Figure 1: Conversational chain iteration on pairs of sentences
Figure 2: Evolution of similarity results with temperature
Figure 3: Evolution of similarity results with the number of examples given to the prompt
Figure 4: Pearson correlation as function of temperature and sample size

Advancing Similarity Search with GenAI: A Retrieval Augmented Generation Approach

TL;DR

Abstract

Advancing Similarity Search with GenAI: A Retrieval Augmented Generation Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (4)