Table of Contents
Fetching ...

Retrieval Augmented Generation for Domain-specific Question Answering

Sanat Sharma, David Seunghyun Yoon, Franck Dernoncourt, Dewang Sultania, Karishma Bagga, Mengjiao Zhang, Trung Bui, Varun Kotte

TL;DR

Domain-specific QA for Adobe products is challenging for general LLMs due to terminology gaps and dynamic product information. The paper presents a retrieval-augmented QA framework with a domain-tuned retriever trained on Adobe data and a retrieval-aware finetuning regime for an LLM, augmented by query disambiguation and privacy-preserving preprocessing. Finetuning uses grounded $d^{+}$, negative $d^{-}$, and $(q,a)$ triplets with $y = \text{LLM}_\theta(d^{+},d^{-}, q)$, enabling grounded, up-to-date answers with reduced hallucinations. Empirical results show improved retrieval quality (e.g., $nDCG$) and generation fidelity, enabling practical, in-product Q&A that outperforms generic baselines on Adobe-related queries.

Abstract

Question answering (QA) has become an important application in the advanced development of large language models. General pre-trained large language models for question-answering are not trained to properly understand the knowledge or terminology for a specific domain, such as finance, healthcare, education, and customer service for a product. To better cater to domain-specific understanding, we build an in-house question-answering system for Adobe products. We propose a novel framework to compile a large question-answer database and develop the approach for retrieval-aware finetuning of a Large Language model. We showcase that fine-tuning the retriever leads to major improvements in the final generation. Our overall approach reduces hallucinations during generation while keeping in context the latest retrieval information for contextual grounding.

Retrieval Augmented Generation for Domain-specific Question Answering

TL;DR

Domain-specific QA for Adobe products is challenging for general LLMs due to terminology gaps and dynamic product information. The paper presents a retrieval-augmented QA framework with a domain-tuned retriever trained on Adobe data and a retrieval-aware finetuning regime for an LLM, augmented by query disambiguation and privacy-preserving preprocessing. Finetuning uses grounded , negative , and triplets with , enabling grounded, up-to-date answers with reduced hallucinations. Empirical results show improved retrieval quality (e.g., ) and generation fidelity, enabling practical, in-product Q&A that outperforms generic baselines on Adobe-related queries.

Abstract

Question answering (QA) has become an important application in the advanced development of large language models. General pre-trained large language models for question-answering are not trained to properly understand the knowledge or terminology for a specific domain, such as finance, healthcare, education, and customer service for a product. To better cater to domain-specific understanding, we build an in-house question-answering system for Adobe products. We propose a novel framework to compile a large question-answer database and develop the approach for retrieval-aware finetuning of a Large Language model. We showcase that fine-tuning the retriever leads to major improvements in the final generation. Our overall approach reduces hallucinations during generation while keeping in context the latest retrieval information for contextual grounding.
Paper Structure (24 sections, 1 equation, 3 figures, 7 tables)

This paper contains 24 sections, 1 equation, 3 figures, 7 tables.

Figures (3)

  • Figure 1: An overview of our proposed framework.
  • Figure 2: The training processing for the retriever.
  • Figure 3: An overall architecture for indexing.