Retrieval Augmented Generation for Domain-specific Question Answering

Sanat Sharma; David Seunghyun Yoon; Franck Dernoncourt; Dewang Sultania; Karishma Bagga; Mengjiao Zhang; Trung Bui; Varun Kotte

Retrieval Augmented Generation for Domain-specific Question Answering

Sanat Sharma, David Seunghyun Yoon, Franck Dernoncourt, Dewang Sultania, Karishma Bagga, Mengjiao Zhang, Trung Bui, Varun Kotte

TL;DR

Domain-specific QA for Adobe products is challenging for general LLMs due to terminology gaps and dynamic product information. The paper presents a retrieval-augmented QA framework with a domain-tuned retriever trained on Adobe data and a retrieval-aware finetuning regime for an LLM, augmented by query disambiguation and privacy-preserving preprocessing. Finetuning uses grounded $d^{+}$, negative $d^{-}$, and $(q,a)$ triplets with $y = \text{LLM}_\theta(d^{+},d^{-}, q)$, enabling grounded, up-to-date answers with reduced hallucinations. Empirical results show improved retrieval quality (e.g., $nDCG$) and generation fidelity, enabling practical, in-product Q&A that outperforms generic baselines on Adobe-related queries.

Abstract

Question answering (QA) has become an important application in the advanced development of large language models. General pre-trained large language models for question-answering are not trained to properly understand the knowledge or terminology for a specific domain, such as finance, healthcare, education, and customer service for a product. To better cater to domain-specific understanding, we build an in-house question-answering system for Adobe products. We propose a novel framework to compile a large question-answer database and develop the approach for retrieval-aware finetuning of a Large Language model. We showcase that fine-tuning the retriever leads to major improvements in the final generation. Our overall approach reduces hallucinations during generation while keeping in context the latest retrieval information for contextual grounding.

Retrieval Augmented Generation for Domain-specific Question Answering

TL;DR

, negative

, and

triplets with

, enabling grounded, up-to-date answers with reduced hallucinations. Empirical results show improved retrieval quality (e.g.,

) and generation fidelity, enabling practical, in-product Q&A that outperforms generic baselines on Adobe-related queries.

Abstract

Paper Structure (24 sections, 1 equation, 3 figures, 7 tables)

This paper contains 24 sections, 1 equation, 3 figures, 7 tables.

Introduction
Related Work
LLM-based question answering systems
Retrieval augmented question answering systems
Method
Framework Overview
Retriever
Retriever Training Dataset
Retriever Model Training
Retrieval Index Creation and Database
Preprocessing for Building the Database
QA Generation Module
Named Entity Removal Module
Query Augmentation via Product Identification
LLM prompting
...and 9 more sections

Figures (3)

Figure 1: An overview of our proposed framework.
Figure 2: The training processing for the retriever.
Figure 3: An overall architecture for indexing.

Retrieval Augmented Generation for Domain-specific Question Answering

TL;DR

Abstract

Retrieval Augmented Generation for Domain-specific Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (3)