Table of Contents
Fetching ...

Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models

Fengchen Liu, Jordan Jung, Wei Feinstein, Jeff DAmbrogia, Gary Jung

TL;DR

A novel approach to enhancing closed-domain Question Answering (QA) systems, focusing on the specific needs of the Lawrence Berkeley National Laboratory (LBL) Science Information Technology (ScienceIT) domain, with a detailed comparison of two fine-tuned large language models and five retrieval-augmented generation (RAG) models.

Abstract

This paper introduces a novel approach to enhancing closed-domain Question Answering (QA) systems, focusing on the specific needs of the Lawrence Berkeley National Laboratory (LBL) Science Information Technology (ScienceIT) domain. Utilizing a rich dataset derived from the ScienceIT documentation, our study embarks on a detailed comparison of two fine-tuned large language models and five retrieval-augmented generation (RAG) models. Through data processing techniques, we transform the documentation into structured context-question-answer triples, leveraging the latest Large Language Models (AWS Bedrock, GCP PaLM2, Meta LLaMA2, OpenAI GPT-4, Google Gemini-Pro) for data-driven insights. Additionally, we introduce the Aggregated Knowledge Model (AKM), which synthesizes responses from the seven models mentioned above using K-means clustering to select the most representative answers. The evaluation of these models across multiple metrics offers a comprehensive look into their effectiveness and suitability for the LBL ScienceIT environment. The results demonstrate the potential benefits of integrating fine-tuning and retrieval-augmented strategies, highlighting significant performance improvements achieved with the AKM. The insights gained from this study can be applied to develop specialized QA systems tailored to specific domains.

Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models

TL;DR

A novel approach to enhancing closed-domain Question Answering (QA) systems, focusing on the specific needs of the Lawrence Berkeley National Laboratory (LBL) Science Information Technology (ScienceIT) domain, with a detailed comparison of two fine-tuned large language models and five retrieval-augmented generation (RAG) models.

Abstract

This paper introduces a novel approach to enhancing closed-domain Question Answering (QA) systems, focusing on the specific needs of the Lawrence Berkeley National Laboratory (LBL) Science Information Technology (ScienceIT) domain. Utilizing a rich dataset derived from the ScienceIT documentation, our study embarks on a detailed comparison of two fine-tuned large language models and five retrieval-augmented generation (RAG) models. Through data processing techniques, we transform the documentation into structured context-question-answer triples, leveraging the latest Large Language Models (AWS Bedrock, GCP PaLM2, Meta LLaMA2, OpenAI GPT-4, Google Gemini-Pro) for data-driven insights. Additionally, we introduce the Aggregated Knowledge Model (AKM), which synthesizes responses from the seven models mentioned above using K-means clustering to select the most representative answers. The evaluation of these models across multiple metrics offers a comprehensive look into their effectiveness and suitability for the LBL ScienceIT environment. The results demonstrate the potential benefits of integrating fine-tuning and retrieval-augmented strategies, highlighting significant performance improvements achieved with the AKM. The insights gained from this study can be applied to develop specialized QA systems tailored to specific domains.

Paper Structure

This paper contains 22 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Workflow of the Enhanced ScienceIT QA System. This diagram illustrates the two-fold process in the QA system's development. Panel A shows the data preparation phase, where ScienceIT documents are split into chunks and processed to generate embeddings with AWS Bedrock, GCP PaLM, Meta LLaMA, and OpenAI GPT-4, facilitated by LangChain, creating a VectorStore for RAG (green) and training examples for model fine-tuning (blue). Panel B depicts the query processing flow where a fine-tuned LLM answers a user’s question, and the RAG retrieves relevant document chunks from the VectorStore using similarity search. The question, along with retrieved chunks, is passed through various LLMs, including AWS Bedrock, GCP PaLM, Meta LLaMA (self-hosted), OpenAI GPT-4, and Google Gemini-Pro, to generate precise answers. The Aggregated Knowledge Model (AKM) further enhances the system by synthesizing answers from the fine-tuned and RAG models using K-means clustering to select the most representative answer (red), improving overall accuracy and reliability.
  • Figure 2: Performance metrics of different models across various evaluation metrics. The figure compares eight models: two fine-tuned, five retrieval-augmented generation (RAG), and the Aggregated Knowledge Model (AKM). The models were evaluated using approximately 560 ScienceIT domain knowledge questions, with the process repeated 100 times. Metrics include BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-1 (Precision), ROUGE-1 (Recall), ROUGE-1 (F1), ROUGE-2 (Precision), ROUGE-2 (Recall), ROUGE-2 (F1), ROUGE-L (Precision), ROUGE-L (Recall), ROUGE-L (F1), and Semantic Textual Similarity (STS). Each bar represents the mean value of the metric for a model, with error bars indicating the standard deviation. The AKM model aggregates responses from the seven models using K-means clustering, showing improved performance across the metrics.
  • Figure 3: Distribution of metrics for each model. The plot shows the performance metrics (BLEU, ROUGE, and STS) distributions for eight models evaluated on approximately 560 ScienceIT domain knowledge questions. Each model's performance was assessed 100 times, resulting in distributions. The red lines indicate the mean values. Model 1: GCP PaLM2 (text-bison-001) w/ Fine-tune. Model 2: GCP PaLM2 (text-bison-001) w/ context and Fine-tune. Model 3: AWS Bedrock (Titan & Claude-v2) w/ RAG. Model 4: GCP PaLM2 (text-bison-001) w/ RAG. Model 5: Meta LLaMA2 (13b-chat-Q5) w/ RAG. Model 6: OpenAI GPT-4 w/ RAG. Model 7: Google Gemini-Pro w/ RAG. Model 8: Aggregated Knowledge Model (AKM). STS: Semantic Textual Similarity.