RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

Joao Monteiro; Pierre-Andre Noel; Etienne Marcotte; Sai Rajeswar; Valentina Zantedeschi; David Vazquez; Nicolas Chapados; Christopher Pal; Perouz Taslakian

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

Joao Monteiro, Pierre-Andre Noel, Etienne Marcotte, Sai Rajeswar, Valentina Zantedeschi, David Vazquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian

TL;DR

A large-scale benchmark comprising several state-of-the-art LLMs is run to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting.

Abstract

Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

TL;DR

Abstract

Paper Structure (15 sections, 10 figures, 2 tables)

This paper contains 15 sections, 10 figures, 2 tables.

Introduction
How to benchmark with RepLiQA
Creating RepLiQA's Content and Annotations
Content Creators and Annotators
Dataset Creation by the Vendor
Dataset Finalization by the Authors
Post-processing and Assemblage
Release Schedule and Potential Leaks
Maintenance
Benchmarking LLMs with RepLiQA on Reading Comprehension
Question-Answering
Effect of Scaling Model Size
Testing The Ability of Models to Admit Lack of Knowledge
Topic Retrieval
Conclusion

Figures (10)

Figure 1: Creating RepLiQA. ${\textsc{RepLiQA}}_0$ was exposed to the web in May 2024 through online LLM inference experiments. See Table \ref{['tab:dataset_stats']} for the release schedule of the remaining splits.
Figure 2: A sample from RepLiQA showing the topic, an excerpt from the supporting document, and a question-answer pair.
Figure 3: RepLiQA0 reference documents topics, with their occurrence counts within parentheses.
Figure 4: (top) Recall of various models on question answering for RepLiQA0 and TriviaQA. (bottom) Difference in recall on question answering between RepLiQA0 and TriviaQA.
Figure 5: Side-by-side performances for each model on RepLiQA and TriviaQA.
...and 5 more figures

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

TL;DR

Abstract

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

TL;DR

Abstract

Table of Contents

Figures (10)