Table of Contents
Fetching ...

Evaluating the Retrieval Component in LLM-Based Question Answering Systems

Ashkan Alinejad, Krtin Kumar, Ali Vahdat

TL;DR

The paper tackles the challenge of evaluating the retriever in Retrieval-Augmented Generation QA systems, where LLMs may generate correct answers even with imperfect retrieval. It introduces LLM-retEval, an end-to-end evaluation framework that compares QA outputs when the system uses retrieved documents versus an ideal set of gold documents, thereby aligning retriever assessment with downstream QA performance and mitigating the shortcomings of traditional precision/recall metrics. Using the NQ-open dataset, dense retrieval with e5-large-v2, and generators GPT-4 and ChatGPT-Turbo, the authors show that LLM-retEval correlates with overall QA outcomes and is affected by issues such as distractors and annotation gaps, which conventional metrics tend to mishandle. The results indicate GPT-4 generally yields stronger alignment with recall signals than ChatGPT-Turbo, and the framework provides a practical, robust approach for end-to-end retriever evaluation with potential impact on developing more reliable RAG systems.

Abstract

Question answering systems (QA) utilizing Large Language Models (LLMs) heavily depend on the retrieval component to provide them with domain-specific information and reduce the risk of generating inaccurate responses or hallucinations. Although the evaluation of retrievers dates back to the early research in Information Retrieval, assessing their performance within LLM-based chatbots remains a challenge. This study proposes a straightforward baseline for evaluating retrievers in Retrieval-Augmented Generation (RAG)-based chatbots. Our findings demonstrate that this evaluation framework provides a better image of how the retriever performs and is more aligned with the overall performance of the QA system. Although conventional metrics such as precision, recall, and F1 score may not fully capture LLMs' capabilities - as they can yield accurate responses despite imperfect retrievers - our method considers LLMs' strengths to ignore irrelevant contexts, as well as potential errors and hallucinations in their responses.

Evaluating the Retrieval Component in LLM-Based Question Answering Systems

TL;DR

The paper tackles the challenge of evaluating the retriever in Retrieval-Augmented Generation QA systems, where LLMs may generate correct answers even with imperfect retrieval. It introduces LLM-retEval, an end-to-end evaluation framework that compares QA outputs when the system uses retrieved documents versus an ideal set of gold documents, thereby aligning retriever assessment with downstream QA performance and mitigating the shortcomings of traditional precision/recall metrics. Using the NQ-open dataset, dense retrieval with e5-large-v2, and generators GPT-4 and ChatGPT-Turbo, the authors show that LLM-retEval correlates with overall QA outcomes and is affected by issues such as distractors and annotation gaps, which conventional metrics tend to mishandle. The results indicate GPT-4 generally yields stronger alignment with recall signals than ChatGPT-Turbo, and the framework provides a practical, robust approach for end-to-end retriever evaluation with potential impact on developing more reliable RAG systems.

Abstract

Question answering systems (QA) utilizing Large Language Models (LLMs) heavily depend on the retrieval component to provide them with domain-specific information and reduce the risk of generating inaccurate responses or hallucinations. Although the evaluation of retrievers dates back to the early research in Information Retrieval, assessing their performance within LLM-based chatbots remains a challenge. This study proposes a straightforward baseline for evaluating retrievers in Retrieval-Augmented Generation (RAG)-based chatbots. Our findings demonstrate that this evaluation framework provides a better image of how the retriever performs and is more aligned with the overall performance of the QA system. Although conventional metrics such as precision, recall, and F1 score may not fully capture LLMs' capabilities - as they can yield accurate responses despite imperfect retrievers - our method considers LLMs' strengths to ignore irrelevant contexts, as well as potential errors and hallucinations in their responses.
Paper Structure (13 sections, 1 equation, 3 figures, 2 tables)

This paper contains 13 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Qualitative examples cases where conventional metrics fail, along with LLM-retEval scores. The text in green color is the correct answer to the question.
  • Figure 2: The prompt template used for answer generation.
  • Figure 3: The prompt template based on GPT-Eval for comparing LLM responses.