Table of Contents
Fetching ...

Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers

Negar Arabzadeh, Amin Bigdeli, Charles L. A. Clarke

TL;DR

The paper tackles the challenge of evaluating QA answers produced by large language models without external citations. It adapts information retrieval benchmarks into an evaluation framework by measuring embedding-based similarity between generated answers (and retrieved passages) and ground-truth relevant content, inspired by BERTScore, and by comparing generated outputs to the top results from diverse retrieval models. Through experiments on MS MARCO and TREC DL datasets, it shows that the approach correlates with traditional metrics like $ndcg@10$, works even without relevance judgments, and reveals that models like GPT-4 can rival retrieval-based methods in this evaluation space. This work provides a practical, scalable bridge between generation and retrieval evaluation, enabling fair comparisons across models and prompts and guiding the development of better evaluation strategies for generative QA systems.

Abstract

Large language models can now directly generate answers to many factual questions without referencing external sources. Unfortunately, relatively little attention has been paid to methods for evaluating the quality and correctness of these answers, for comparing the performance of one model to another, or for comparing one prompt to another. In addition, the quality of generated answers are rarely directly compared to the quality of retrieved answers. As models evolve and prompts are modified, we have no systematic way to measure improvements without resorting to expensive human judgments. To address this problem we adapt standard retrieval benchmarks to evaluate answers generated by large language models. Inspired by the BERTScore metric for summarization, we explore two approaches. In the first, we base our evaluation on the benchmark relevance judgments. We empirically run experiments on how information retrieval relevance judgments can be utilized as an anchor to evaluating the generated answers. In the second, we compare generated answers to the top results retrieved by a diverse set of retrieval models, ranging from traditional approaches to advanced methods, allowing us to measure improvements without human judgments. In both cases, we measure the similarity between an embedded representation of the generated answer and an embedded representation of a known, or assumed, relevant passage from the retrieval benchmark.

Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers

TL;DR

The paper tackles the challenge of evaluating QA answers produced by large language models without external citations. It adapts information retrieval benchmarks into an evaluation framework by measuring embedding-based similarity between generated answers (and retrieved passages) and ground-truth relevant content, inspired by BERTScore, and by comparing generated outputs to the top results from diverse retrieval models. Through experiments on MS MARCO and TREC DL datasets, it shows that the approach correlates with traditional metrics like , works even without relevance judgments, and reveals that models like GPT-4 can rival retrieval-based methods in this evaluation space. This work provides a practical, scalable bridge between generation and retrieval evaluation, enabling fair comparisons across models and prompts and guiding the development of better evaluation strategies for generative QA systems.

Abstract

Large language models can now directly generate answers to many factual questions without referencing external sources. Unfortunately, relatively little attention has been paid to methods for evaluating the quality and correctness of these answers, for comparing the performance of one model to another, or for comparing one prompt to another. In addition, the quality of generated answers are rarely directly compared to the quality of retrieved answers. As models evolve and prompts are modified, we have no systematic way to measure improvements without resorting to expensive human judgments. To address this problem we adapt standard retrieval benchmarks to evaluate answers generated by large language models. Inspired by the BERTScore metric for summarization, we explore two approaches. In the first, we base our evaluation on the benchmark relevance judgments. We empirically run experiments on how information retrieval relevance judgments can be utilized as an anchor to evaluating the generated answers. In the second, we compare generated answers to the top results retrieved by a diverse set of retrieval models, ranging from traditional approaches to advanced methods, allowing us to measure improvements without human judgments. In both cases, we measure the similarity between an embedded representation of the generated answer and an embedded representation of a known, or assumed, relevant passage from the retrieval benchmark.
Paper Structure (11 sections, 5 figures, 3 tables)

This paper contains 11 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Prompts used in different settings for generating the answers.
  • Figure 2: Distribution of similarities between qrels in different levels of relevance on DL 2019 and DL 2020. The mean and median of each distribution are shown with a $\times$ and a horizontal line in the boxes.
  • Figure 3: The similarity of the responses of the generated models on TREC2019 and 2020 w.r.t the similarity with judged passages in different levels.
  • Figure 4: The performance of all the runs submitted to TREC DL in 2019 and 2020, as well as the performance of our generated models on these datasets. While the submitted runs are depicted using both ndcg@10 and the similarity score, the gray area shows only the similarity score for the generated runs. The ndcg@10 metric is not applicable to the colored points. Kendal $\tau$ correlation between ndcg@10 and the similarity of retrieved results with the qrels as explained in section \ref{['sec:assesswithqrel']} are mentioned above each sub-figure.
  • Figure 5: The performance of the generated models on MS MARCOdev set based on the similarity with top retrieved documents from different retrieval methods.