Table of Contents
Fetching ...

Golden-Retriever: High-Fidelity Agentic Retrieval Augmented Generation for Industrial Knowledge Base

Zhiyu An, Xianzhong Ding, Yen-Chun Fu, Cheng-Chung Chu, Yan Li, Wan Du

TL;DR

Golden-Retriever tackles the challenge of domain-specific jargon in industrial knowledge bases by integrating a pre-retrieval reflection-based augmentation that clarifies jargon and context. The method combines offline OCR-based document augmentation with an online, LLM-driven jargon/context identification that augments queries via a jargon dictionary and structured templates, before feeding them into RAG. Empirical results across three open-source LLM backbones on a domain-specific QA dataset show significant accuracy gains over vanilla LLM and vanilla RAG, along with robust performance on an abbreviation identification task. This approach enables scalable, non-fine-tuning knowledge integration and more accurate retrieval in industrial settings, reducing misinterpretation and improving knowledge access for engineers and knowledge workers.

Abstract

This paper introduces Golden-Retriever, designed to efficiently navigate vast industrial knowledge bases, overcoming challenges in traditional LLM fine-tuning and RAG frameworks with domain-specific jargon and context interpretation. Golden-Retriever incorporates a reflection-based question augmentation step before document retrieval, which involves identifying jargon, clarifying its meaning based on context, and augmenting the question accordingly. Specifically, our method extracts and lists all jargon and abbreviations in the input question, determines the context against a pre-defined list, and queries a jargon dictionary for extended definitions and descriptions. This comprehensive augmentation ensures the RAG framework retrieves the most relevant documents by providing clear context and resolving ambiguities, significantly improving retrieval accuracy. Evaluations using three open-source LLMs on a domain-specific question-answer dataset demonstrate Golden-Retriever's superior performance, providing a robust solution for efficiently integrating and querying industrial knowledge bases.

Golden-Retriever: High-Fidelity Agentic Retrieval Augmented Generation for Industrial Knowledge Base

TL;DR

Golden-Retriever tackles the challenge of domain-specific jargon in industrial knowledge bases by integrating a pre-retrieval reflection-based augmentation that clarifies jargon and context. The method combines offline OCR-based document augmentation with an online, LLM-driven jargon/context identification that augments queries via a jargon dictionary and structured templates, before feeding them into RAG. Empirical results across three open-source LLM backbones on a domain-specific QA dataset show significant accuracy gains over vanilla LLM and vanilla RAG, along with robust performance on an abbreviation identification task. This approach enables scalable, non-fine-tuning knowledge integration and more accurate retrieval in industrial settings, reducing misinterpretation and improving knowledge access for engineers and knowledge workers.

Abstract

This paper introduces Golden-Retriever, designed to efficiently navigate vast industrial knowledge bases, overcoming challenges in traditional LLM fine-tuning and RAG frameworks with domain-specific jargon and context interpretation. Golden-Retriever incorporates a reflection-based question augmentation step before document retrieval, which involves identifying jargon, clarifying its meaning based on context, and augmenting the question accordingly. Specifically, our method extracts and lists all jargon and abbreviations in the input question, determines the context against a pre-defined list, and queries a jargon dictionary for extended definitions and descriptions. This comprehensive augmentation ensures the RAG framework retrieves the most relevant documents by providing clear context and resolving ambiguities, significantly improving retrieval accuracy. Evaluations using three open-source LLMs on a domain-specific question-answer dataset demonstrate Golden-Retriever's superior performance, providing a robust solution for efficiently integrating and querying industrial knowledge bases.
Paper Structure (27 sections, 3 figures, 2 tables)

This paper contains 27 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: An illustration comparing our method with related works. We consider two types of methods: offline and online. On the upper-left, existing offline methods use LLMs to generate datasets for training. The upper-right shows our offline method, using LLMs to enhance the document database for the online phase. Online methods are depicted in the lower part of the figure. From lower-left to lower-right: Corrective RAG and Self-RAG modify the response of RAG after the document retrieval step. If the user's question is ambiguous or lacks context, RAG cannot retrieve the most relevant documents, limiting the effectiveness of these methods. Another approach deconstructs the question into an AST and synthesizes SQL queries accordingly, improving query fidelity but only applicable to SQL queries. Our method reflects upon the question, identifies its context, and augments the question by querying a jargon dictionary before document retrieval. The augmented question allows RAG to faithfully retrieve the most relevant documents despite ambiguous jargon or lack of explicit context.
  • Figure 2: Left: the workflow diagram of the online inference part of Golden-Retriever. Right: example interactions between the system and the LLM at the intermediate steps of the workflow. The system prompts LLM to generate intermediate responses, which are saved, accessed, and used for future steps in the workflow.
  • Figure 3: Section \ref{['Sec: LLM-driven Document Augmentation']}. Illustration of document pre-processing and an example prompt implementation of the LLM-Driven Document Augmentation Process.