Table of Contents
Fetching ...

Enhanced document retrieval with topic embeddings

Kavsar Huseynova, Jafar Isbarov

TL;DR

The paper addresses retrieval accuracy bottlenecks in retrieval-augmented generation (RAG) when corpora contain multiple related topics. It proposes two topic-aware retrieval methods: (i) topic-enhanced document embeddings created by combining topic embeddings with original document embeddings via averaging or concatenation, and (ii) a two-stage retrieval that first selects a topic and then a document within that topic. Experiments on a dataset of Azerbaijani laws show that incorporating topic information improves topic separation, with the averaging method outperforming concatenation, though the two-stage approach could not be fully evaluated due to data constraints. The authors highlight the need for end-to-end evaluation datasets and multilingual validation to generalize the findings across languages and domains.

Abstract

Document retrieval systems have experienced a revitalized interest with the advent of retrieval-augmented generation (RAG). RAG architecture offers a lower hallucination rate than LLM-only applications. However, the accuracy of the retrieval mechanism is known to be a bottleneck in the efficiency of these applications. A particular case of subpar retrieval performance is observed in situations where multiple documents from several different but related topics are in the corpus. We have devised a new vectorization method that takes into account the topic information of the document. The paper introduces this new method for text vectorization and evaluates it in the context of RAG. Furthermore, we discuss the challenge of evaluating RAG systems, which pertains to the case at hand.

Enhanced document retrieval with topic embeddings

TL;DR

The paper addresses retrieval accuracy bottlenecks in retrieval-augmented generation (RAG) when corpora contain multiple related topics. It proposes two topic-aware retrieval methods: (i) topic-enhanced document embeddings created by combining topic embeddings with original document embeddings via averaging or concatenation, and (ii) a two-stage retrieval that first selects a topic and then a document within that topic. Experiments on a dataset of Azerbaijani laws show that incorporating topic information improves topic separation, with the averaging method outperforming concatenation, though the two-stage approach could not be fully evaluated due to data constraints. The authors highlight the need for end-to-end evaluation datasets and multilingual validation to generalize the findings across languages and domains.

Abstract

Document retrieval systems have experienced a revitalized interest with the advent of retrieval-augmented generation (RAG). RAG architecture offers a lower hallucination rate than LLM-only applications. However, the accuracy of the retrieval mechanism is known to be a bottleneck in the efficiency of these applications. A particular case of subpar retrieval performance is observed in situations where multiple documents from several different but related topics are in the corpus. We have devised a new vectorization method that takes into account the topic information of the document. The paper introduces this new method for text vectorization and evaluates it in the context of RAG. Furthermore, we discuss the challenge of evaluating RAG systems, which pertains to the case at hand.
Paper Structure (11 sections, 2 figures, 2 tables)

This paper contains 11 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Process for generating topic embeddings from original documents.
  • Figure 2: 2D visualization of topics with (a) original embeddings, (b) averaged embeddings, (c) appended embeddings.