Enhanced document retrieval with topic embeddings

Kavsar Huseynova; Jafar Isbarov

Enhanced document retrieval with topic embeddings

Kavsar Huseynova, Jafar Isbarov

TL;DR

The paper addresses retrieval accuracy bottlenecks in retrieval-augmented generation (RAG) when corpora contain multiple related topics. It proposes two topic-aware retrieval methods: (i) topic-enhanced document embeddings created by combining topic embeddings with original document embeddings via averaging or concatenation, and (ii) a two-stage retrieval that first selects a topic and then a document within that topic. Experiments on a dataset of Azerbaijani laws show that incorporating topic information improves topic separation, with the averaging method outperforming concatenation, though the two-stage approach could not be fully evaluated due to data constraints. The authors highlight the need for end-to-end evaluation datasets and multilingual validation to generalize the findings across languages and domains.

Abstract

Document retrieval systems have experienced a revitalized interest with the advent of retrieval-augmented generation (RAG). RAG architecture offers a lower hallucination rate than LLM-only applications. However, the accuracy of the retrieval mechanism is known to be a bottleneck in the efficiency of these applications. A particular case of subpar retrieval performance is observed in situations where multiple documents from several different but related topics are in the corpus. We have devised a new vectorization method that takes into account the topic information of the document. The paper introduces this new method for text vectorization and evaluates it in the context of RAG. Furthermore, we discuss the challenge of evaluating RAG systems, which pertains to the case at hand.

Enhanced document retrieval with topic embeddings

TL;DR

Abstract

Enhanced document retrieval with topic embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (2)