Evaluation of RAG Metrics for Question Answering in the Telecom Domain

Sujoy Roychowdhury; Sumit Soman; H G Ranjani; Neeraj Gunda; Vansh Chhabra; Sai Krishna Bala

Evaluation of RAG Metrics for Question Answering in the Telecom Domain

Sujoy Roychowdhury, Sumit Soman, H G Ranjani, Neeraj Gunda, Vansh Chhabra, Sai Krishna Bala

TL;DR

This work tackles the challenge of evaluating retrieval-augmented QA in the telecom domain by extending the RAGAS evaluation framework to expose intermediate outputs from prompts. It systematically studies how retriever embeddings (including domain-adapted variants) and generator instruction tuning affect RAGAS metrics, using TeleQuAD data derived from 3GPP Release 15 and SME judgments as ground truth. The findings indicate that Factual Correctness ($FacCor$) and Faithfulness ($FaiFul$) align best with expert evaluations, and domain adaptation improves concordance, while metrics like Answer Relevance ($AnsRel$) and Context Relevance ($ConRel$ can be less reliable and harder to interpret. These insights inform end-to-end telecom QA deployments and highlight directions for broader evaluation across domains and libraries.

Abstract

Retrieval Augmented Generation (RAG) is widely used to enable Large Language Models (LLMs) perform Question Answering (QA) tasks in various domains. However, RAG based on open-source LLM for specialized domains has challenges of evaluating generated responses. A popular framework in the literature is the RAG Assessment (RAGAS), a publicly available library which uses LLMs for evaluation. One disadvantage of RAGAS is the lack of details of derivation of numerical value of the evaluation metrics. One of the outcomes of this work is a modified version of this package for few metrics (faithfulness, context relevance, answer relevance, answer correctness, answer similarity and factual correctness) through which we provide the intermediate outputs of the prompts by using any LLMs. Next, we analyse the expert evaluations of the output of the modified RAGAS package and observe the challenges of using it in the telecom domain. We also study the effect of the metrics under correct vs. wrong retrieval and observe that few of the metrics have higher values for correct retrieval. We also study for differences in metrics between base embeddings and those domain adapted via pre-training and fine-tuning. Finally, we comment on the suitability and challenges of using these metrics for in-the-wild telecom QA task.

Evaluation of RAG Metrics for Question Answering in the Telecom Domain

TL;DR

) and Faithfulness (

) align best with expert evaluations, and domain adaptation improves concordance, while metrics like Answer Relevance (

) and Context Relevance (

can be less reliable and harder to interpret. These insights inform end-to-end telecom QA deployments and highlight directions for broader evaluation across domains and libraries.

Abstract

Paper Structure (17 sections, 8 equations, 3 figures, 4 tables)

This paper contains 17 sections, 8 equations, 3 figures, 4 tables.

Introduction
Research Questions
Experimental Setup
Dataset
Retriever Models
Generator
RAG Evaluation
Results and Discussion
Discussion on Metrics
Conclusions and Future Work
Computation of RAGAS Metrics
Faithfulness ($FaiFul$)
Answer Relevance ($AnsRel$)
Context Relevance ($ConRel$)
Answer Similarity ($AnsSim$)
...and 2 more sections

Figures (3)

Figure 1: Schematic showing our experimental setup. Dotted arrows indicate that the retriever and generator are evaluated with both the base and domain adapted variants.
Figure 2: Summary view of RAGAS Metrics and their computation. Green check mark indicates recommended metrics, based on our experiments.
Figure 3: Sample Questions

Evaluation of RAG Metrics for Question Answering in the Telecom Domain

TL;DR

Abstract

Evaluation of RAG Metrics for Question Answering in the Telecom Domain

Authors

TL;DR

Abstract

Table of Contents

Figures (3)