Table of Contents
Fetching ...

Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider

Tina. J. Jat, T. Ghosh, Karthik Suresh

Abstract

To harness the power of Language Models in answering domain specific specialized technical questions, Retrieval Augmented Generation (RAG) is been used widely. In this work, we have developed a Q\&A application inspired by the Retrieval Augmented Generation (RAG), which is comprised of an in-house database indexed on the arXiv articles related to the Electron-Ion Collider (EIC) experiment - one of the largest international scientific collaboration and incorporated an open-source LLaMA model for answer generation. This is an extension to it's proceeding application built on proprietary model and Cloud-hosted external knowledge-base for the EIC experiment. This locally-deployed RAG-system offers a cost-effective, resource-constraint alternative solution to build a RAG-assisted Q\&A application on answering domain-specific queries in the field of experimental nuclear physics. This set-up facilitates data-privacy, avoids sending any pre-publication scientific data and information to public domain. Future improvement will expand the knowledge base to encompass heterogeneous EIC-related publications and reports and upgrade the application pipeline orchestration to the LangGraph framework.

Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider

Abstract

To harness the power of Language Models in answering domain specific specialized technical questions, Retrieval Augmented Generation (RAG) is been used widely. In this work, we have developed a Q\&A application inspired by the Retrieval Augmented Generation (RAG), which is comprised of an in-house database indexed on the arXiv articles related to the Electron-Ion Collider (EIC) experiment - one of the largest international scientific collaboration and incorporated an open-source LLaMA model for answer generation. This is an extension to it's proceeding application built on proprietary model and Cloud-hosted external knowledge-base for the EIC experiment. This locally-deployed RAG-system offers a cost-effective, resource-constraint alternative solution to build a RAG-assisted Q\&A application on answering domain-specific queries in the field of experimental nuclear physics. This set-up facilitates data-privacy, avoids sending any pre-publication scientific data and information to public domain. Future improvement will expand the knowledge base to encompass heterogeneous EIC-related publications and reports and upgrade the application pipeline orchestration to the LangGraph framework.

Paper Structure

This paper contains 7 sections, 5 figures.

Figures (5)

  • Figure 1: The schematic of the Q&A System Design for the EIC. The pipeline consists of article ingestion, document chunking, vector embedding, ChromaDB indexing, retrieval, formation of prompting to propagate the information and response generation. First, a query is encoded into a dense-vector representation during the data ingestion phase; subsequently similarity search is initiated between the embedded query and locally deployed vectorized database to retrieve the most relevant contexts. These retrieved chunks are subsequently merged with the query through a carefully designed prompt template and passed to a language model, which finally generates a contextually grounded response.
  • Figure 2: Left : Retrieval latency of the RAG-system for chunk size 120 and 180 characters with overlap of 20 characters between two consecutive chunks. Right: Retrieval latency measured for two two different similarity retrieval mechanism: Cosine similarity and MMR.
  • Figure 3: Latency distribution during answer generation of the RAG-application for two different Language Models LLaMA 3.2 and LLaMA 3.3 respectively.
  • Figure 4: The performance of the Q&A is represented by the six RAGAS evaluation metrics for retrieval and answer generation. Upper panel: The evaluation metrics for chunk size of 120 characters for Cosine similarity; Lower panel: Same metrics as measured for MMR retrieval technique and 120 chunk size.
  • Figure 5: The generation performance as characterized by the six metrics of RAGAS framework for chunk size of 180 characters with Cosine similarity(upper panel) and MMR retrieval(lower panel) respectively.