DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes
Jiehan Cheng, Zhicheng Dou
TL;DR
DailyQA presents a dynamically updated benchmark that assesses LLMs’ ability to handle fast-changing factual information by leveraging weekly Wikipedia revision diffs and an automated data pipeline. It demonstrates that retrieval-augmented methods, particularly with document reranking, are essential but insufficient for time-sensitive tasks, and that model scale improves performance on time-aware evaluation metrics. The study provides insights into cross-domain performance and the challenges of integrating temporal information into RAG systems, offering a foundation for future improvements in time-aware QA. Overall, DailyQA offers a valuable, scalable testbed for advancing LLMs’ adaptation to real-world, time-varying knowledge.
Abstract
We propose DailyQA, an automatically updated dynamic dataset that updates questions weekly and contains answers to questions on any given date. DailyQA utilizes daily updates from Wikipedia revision logs to implement a fully automated pipeline of data filtering, query generation synthesis, quality checking, answer extraction, and query classification. The benchmark requires large language models (LLMs) to process and answer questions involving fast-changing factual data and covering multiple domains. We evaluate several open-source and closed-source LLMs using different RAG pipelines with web search augmentation. We compare the ability of different models to process time-sensitive web information and find that rerank of web retrieval results is critical. Our results indicate that LLMs still face significant challenges in handling frequently updated information, suggesting that DailyQA benchmarking provides valuable insights into the direction of progress for LLMs and RAG systems.
