Enhancing Knowledge Retrieval with Topic Modeling for Knowledge-Grounded Dialogue
Nhat Tran, Diane Litman
TL;DR
The paper tackles the knowledge retrieval bottleneck in knowledge-grounded dialogue by introducing topic modeling on the knowledge base to create topic-guided clusters and per-cluster encoders, integrated into a DPR/DPR-topic retrieval within a RAG framework (RAG-topic) and its context-aware variant (RAG-context-topic). It shows that incorporating topic distributions into retrieval scores improves top-K recall across two datasets (MultiDoc2Dial and KILT-dialogue) and that selecting the number of topics via validation yields robust gains, with optimal values typically around 4–5. The study also probes ChatGPT as a generator, finding it performs best when supplied with relevant retrieved knowledge, achieving higher generation metrics than retrieval-only baselines. Overall, the work demonstrates that topic-informed retrieval can significantly boost both retrieval and generation in knowledge-grounded dialogue and highlights practical trade-offs and future directions for integrating such methods with large language models.
Abstract
Knowledge retrieval is one of the major challenges in building a knowledge-grounded dialogue system. A common method is to use a neural retriever with a distributed approximate nearest-neighbor database to quickly find the relevant knowledge sentences. In this work, we propose an approach that utilizes topic modeling on the knowledge base to further improve retrieval accuracy and as a result, improve response generation. Additionally, we experiment with a large language model, ChatGPT, to take advantage of the improved retrieval performance to further improve the generation results. Experimental results on two datasets show that our approach can increase retrieval and generation performance. The results also indicate that ChatGPT is a better response generator for knowledge-grounded dialogue when relevant knowledge is provided.
