Table of Contents
Fetching ...

Leveraging Retrieval-Augmented Generation for Persian University Knowledge Retrieval

Arshia Hemmat, Kianoosh Vadaei, Mohammad Hassan Heydari, Afsaneh Fatemi

TL;DR

This work presents a Retrieval-Augmented Generation (RAG) pipeline tailored for Persian university knowledge retrieval, combining a two-stage RAG framework with a Persian Llama-3 derivative named DORNA and a university-specific dataset called UniversityQuestionBench (UQB). It introduces a domain-focused dataset via web scraping, student surveys, and GPT-4 augmentation, and evaluates the system with the RAGAS metrics—Faithfulness, Answer Relevance, and Context Relevance—showing improvements in grounding and relevance over baselines. The study demonstrates that a domain-adjusted RAG approach can substantially enhance accuracy and user satisfaction for Persian-language academic QA, supported by both quantitative metrics and qualitative human judgments. The work also outlines practical directions for extending dataset diversity, cross-university data integration, and real-time information updates to sustain relevance in dynamic academic environments.

Abstract

This paper introduces an innovative approach using Retrieval-Augmented Generation (RAG) pipelines with Large Language Models (LLMs) to enhance information retrieval and query response systems for university-related question answering. By systematically extracting data from the university official webpage and employing advanced prompt engineering techniques, we generate accurate, contextually relevant responses to user queries. We developed a comprehensive university benchmark, UniversityQuestionBench (UQB), to rigorously evaluate our system performance, based on common key metrics in the filed of RAG pipelines, assessing accuracy and reliability through various metrics and real-world scenarios. Our experimental results demonstrate significant improvements in the precision and relevance of generated responses, enhancing user experience and reducing the time required to obtain relevant answers. In summary, this paper presents a novel application of RAG pipelines and LLMs, supported by a meticulously prepared university benchmark, offering valuable insights into advanced AI techniques for academic data retrieval and setting the stage for future research in this domain.

Leveraging Retrieval-Augmented Generation for Persian University Knowledge Retrieval

TL;DR

This work presents a Retrieval-Augmented Generation (RAG) pipeline tailored for Persian university knowledge retrieval, combining a two-stage RAG framework with a Persian Llama-3 derivative named DORNA and a university-specific dataset called UniversityQuestionBench (UQB). It introduces a domain-focused dataset via web scraping, student surveys, and GPT-4 augmentation, and evaluates the system with the RAGAS metrics—Faithfulness, Answer Relevance, and Context Relevance—showing improvements in grounding and relevance over baselines. The study demonstrates that a domain-adjusted RAG approach can substantially enhance accuracy and user satisfaction for Persian-language academic QA, supported by both quantitative metrics and qualitative human judgments. The work also outlines practical directions for extending dataset diversity, cross-university data integration, and real-time information updates to sustain relevance in dynamic academic environments.

Abstract

This paper introduces an innovative approach using Retrieval-Augmented Generation (RAG) pipelines with Large Language Models (LLMs) to enhance information retrieval and query response systems for university-related question answering. By systematically extracting data from the university official webpage and employing advanced prompt engineering techniques, we generate accurate, contextually relevant responses to user queries. We developed a comprehensive university benchmark, UniversityQuestionBench (UQB), to rigorously evaluate our system performance, based on common key metrics in the filed of RAG pipelines, assessing accuracy and reliability through various metrics and real-world scenarios. Our experimental results demonstrate significant improvements in the precision and relevance of generated responses, enhancing user experience and reducing the time required to obtain relevant answers. In summary, this paper presents a novel application of RAG pipelines and LLMs, supported by a meticulously prepared university benchmark, offering valuable insights into advanced AI techniques for academic data retrieval and setting the stage for future research in this domain.

Paper Structure

This paper contains 24 sections, 16 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Our Proposed Pipeline
  • Figure 2: Data Generation Procedure - In this figure we has shown the question and answer generation.
  • Figure 3: QA samples