Table of Contents
Fetching ...

DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes

Jiehan Cheng, Zhicheng Dou

TL;DR

DailyQA presents a dynamically updated benchmark that assesses LLMs’ ability to handle fast-changing factual information by leveraging weekly Wikipedia revision diffs and an automated data pipeline. It demonstrates that retrieval-augmented methods, particularly with document reranking, are essential but insufficient for time-sensitive tasks, and that model scale improves performance on time-aware evaluation metrics. The study provides insights into cross-domain performance and the challenges of integrating temporal information into RAG systems, offering a foundation for future improvements in time-aware QA. Overall, DailyQA offers a valuable, scalable testbed for advancing LLMs’ adaptation to real-world, time-varying knowledge.

Abstract

We propose DailyQA, an automatically updated dynamic dataset that updates questions weekly and contains answers to questions on any given date. DailyQA utilizes daily updates from Wikipedia revision logs to implement a fully automated pipeline of data filtering, query generation synthesis, quality checking, answer extraction, and query classification. The benchmark requires large language models (LLMs) to process and answer questions involving fast-changing factual data and covering multiple domains. We evaluate several open-source and closed-source LLMs using different RAG pipelines with web search augmentation. We compare the ability of different models to process time-sensitive web information and find that rerank of web retrieval results is critical. Our results indicate that LLMs still face significant challenges in handling frequently updated information, suggesting that DailyQA benchmarking provides valuable insights into the direction of progress for LLMs and RAG systems.

DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes

TL;DR

DailyQA presents a dynamically updated benchmark that assesses LLMs’ ability to handle fast-changing factual information by leveraging weekly Wikipedia revision diffs and an automated data pipeline. It demonstrates that retrieval-augmented methods, particularly with document reranking, are essential but insufficient for time-sensitive tasks, and that model scale improves performance on time-aware evaluation metrics. The study provides insights into cross-domain performance and the challenges of integrating temporal information into RAG systems, offering a foundation for future improvements in time-aware QA. Overall, DailyQA offers a valuable, scalable testbed for advancing LLMs’ adaptation to real-world, time-varying knowledge.

Abstract

We propose DailyQA, an automatically updated dynamic dataset that updates questions weekly and contains answers to questions on any given date. DailyQA utilizes daily updates from Wikipedia revision logs to implement a fully automated pipeline of data filtering, query generation synthesis, quality checking, answer extraction, and query classification. The benchmark requires large language models (LLMs) to process and answer questions involving fast-changing factual data and covering multiple domains. We evaluate several open-source and closed-source LLMs using different RAG pipelines with web search augmentation. We compare the ability of different models to process time-sensitive web information and find that rerank of web retrieval results is critical. Our results indicate that LLMs still face significant challenges in handling frequently updated information, suggesting that DailyQA benchmarking provides valuable insights into the direction of progress for LLMs and RAG systems.

Paper Structure

This paper contains 24 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: A example for DailyQA. The answer to “LeBron James' career total points" can change every day. For each query in DailyQA, we provide an answer on each day.
  • Figure 2: Overview of our DailyQA dataset construction pipeline, which includes filtration and process of the raw data (Wiki revision logs), question generation, quality check, answer extraction, and query classification modules. In the quality check module, we check the correctness and descriptiveness of the queries. In the classification module, we classify queries based on their update frequency and domains
  • Figure 3: The number of answer changes relative to the previous day. For example, on the line with a start date of 2025/01/12, the “+1” position on the horizontal axis indicates that in the corresponding dataset, the answers for 2025/01/13 was changed by about 1,200 relative to the previous day.
  • Figure 4: Percentage of queries with different answer change times. For example, as shown in the left bar, in the query dataset for 2025/01/12-2025/01/18, the percentage of queries whose answer change once is about 70%. Note that consistent with Figure \ref{['fig:answer_change_per_day']}, we count answer changes over a three-week period that includes the week before and after.
  • Figure 5: Distribution of the queries in different domains. In the labels, "W-2025-01-12", for example, means a query update corresponds to the week starting from 2025-01-12.
  • ...and 4 more figures