Table of Contents
Fetching ...

OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking

Yanhong Li, Tianyang Xu, Kenan Tang, Karen Livescu, David McAllester, Jiawei Zhou

TL;DR

This paper introduces OKBench, a fully automated, on-demand framework for dynamic knowledge benchmarking in the news domain to evaluate LLMs and retrieval-augmented methods amid rapidly evolving facts. It builds an agentic pipeline that extracts daily news, generates and validates QA items, and versions datasets to enable reproducible, contamination-resistant evaluation. Empirical results show a substantial drop in accuracy when models rely on parametric memory for fresh knowledge, while retrieval markedly narrows the gap between small and large models and enhances end-to-end QA performance. The work highlights the importance of evaluating LLMs on evolving knowledge and demonstrates a practical, decentralized approach to continuous benchmarking with transparent versioning and open-source tooling.

Abstract

Knowledge-intensive question answering is central to large language models (LLMs) and is typically assessed using static benchmarks derived from sources like Wikipedia and textbooks. However, these benchmarks fail to capture evolving knowledge in a dynamic world, and centralized curation struggles to keep pace with rapid LLM advancements. To address these drawbacks, we propose Open Knowledge Bench (OKBench), a fully automated framework for generating high-quality, dynamic knowledge benchmarks on demand. Focusing on the news domain where knowledge updates daily, OKBench is an agentic framework that automates the sourcing, creation, validation, and distribution of benchmarks. Our approach democratizes benchmark creation and facilitates thorough evaluation of retrieval-augmented methods by reducing overlap with pretraining data. We evaluate our framework on a wide range open-source and proprietary LLMs of various sizes and configurations, both with and without retrieval over freshly generated knowledge. Our results reveal distinct model behaviors when confronted with new information and highlight how retrieval narrows the performance gap between small and large models. These findings underscore the importance of evaluating LLMs on evolving knowledge benchmarks.

OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking

TL;DR

This paper introduces OKBench, a fully automated, on-demand framework for dynamic knowledge benchmarking in the news domain to evaluate LLMs and retrieval-augmented methods amid rapidly evolving facts. It builds an agentic pipeline that extracts daily news, generates and validates QA items, and versions datasets to enable reproducible, contamination-resistant evaluation. Empirical results show a substantial drop in accuracy when models rely on parametric memory for fresh knowledge, while retrieval markedly narrows the gap between small and large models and enhances end-to-end QA performance. The work highlights the importance of evaluating LLMs on evolving knowledge and demonstrates a practical, decentralized approach to continuous benchmarking with transparent versioning and open-source tooling.

Abstract

Knowledge-intensive question answering is central to large language models (LLMs) and is typically assessed using static benchmarks derived from sources like Wikipedia and textbooks. However, these benchmarks fail to capture evolving knowledge in a dynamic world, and centralized curation struggles to keep pace with rapid LLM advancements. To address these drawbacks, we propose Open Knowledge Bench (OKBench), a fully automated framework for generating high-quality, dynamic knowledge benchmarks on demand. Focusing on the news domain where knowledge updates daily, OKBench is an agentic framework that automates the sourcing, creation, validation, and distribution of benchmarks. Our approach democratizes benchmark creation and facilitates thorough evaluation of retrieval-augmented methods by reducing overlap with pretraining data. We evaluate our framework on a wide range open-source and proprietary LLMs of various sizes and configurations, both with and without retrieval over freshly generated knowledge. Our results reveal distinct model behaviors when confronted with new information and highlight how retrieval narrows the performance gap between small and large models. These findings underscore the importance of evaluating LLMs on evolving knowledge benchmarks.

Paper Structure

This paper contains 42 sections, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Automated dynamic knowledge benchmark construction pipeline for OKBench.
  • Figure 2: Top-$k$ Retrieval Accuracy for BM25, DPR, and ColBERT v2 across news corpora of different time windows (1-day, 5-day, and 10-day).
  • Figure 3: No-Context vs. Oracle-Context QA Accuracy on OKBench, plotted alongside each model's performance on MMLU Pro (lighter lines) as a reference for memorized knowledge. We show three representative model families (Gemma, Llama, Qwen) at various parameter scales (Billion Parameters). Solid lines denote No-Context accuracy (fresh knowledge), and dashed lines denote Oracle-Context accuracy when the ground-truth article is provided.
  • Figure 4: QA Accuracy with Retrieval-Augmented Context. Each panel shows QA accuracy (%) for four model families (Gemma, Llama, Qwen, Phi) across different parameter sizes, evaluated on dynamically generated news questions. Lines indicate performance with context retrieved by BM25 (dashed red), ColBERT v2 (solid gray), DPR (dash-dotted cyan), the Oracle (dashed blue, upper bound), and No-Context (solid yellow, lower bound) using top-3 retrieved passages from the 1-day news corpus.
  • Figure 5: Instructions page of the Google Form used for the Question Quality Check. We ask our human annotators to assess clarity and freshness of the multiple-choice questions, based on the provided instructions.
  • ...and 8 more figures