Table of Contents
Fetching ...

DRAGOn: Designing RAG On Periodically Updated Corpus

Fedor Chernogorskii, Sergei Averkiev, Liliya Kudraleeva, Zaven Martirosian, Maria Tikhonova, Valentin Malykh, Alena Fenogenova

TL;DR

DRAGOn presents a dynamic, periodically updated RAG benchmark built on a Russian news corpus to reflect real-world deployment and prevent data leakage. It combines a knowledge-graph-based QA generation pipeline, rigorous multi-stage QA filtering, and LLM-based judgments (via POLLUX) to curate high-quality QA pairs, culminating in a public, versioned leaderboard and open-source evaluation framework. The approach emphasizes reproducibility, modularity, and community-driven progress, with sandbox datasets for local validation and a scalable validation portal for secure evaluation on private ground-truth data. Overall, DRAGOn advances dynamic RAG evaluation by providing standardized workflows, transparent versioning, and a roadmap for multilingual and domain-variant extensions, enabling robust benchmarking of retriever-generator systems in evolving information landscapes.

Abstract

This paper introduces DRAGOn, method to design a RAG benchmark on a regularly updated corpus. It features recent reference datasets, a question generation framework, an automatic evaluation pipeline, and a public leaderboard. Specified reference datasets allow for uniform comparison of RAG systems, while newly generated dataset versions mitigate data leakage and ensure that all models are evaluated on unseen, comparable data. The pipeline for automatic question generation extracts the Knowledge Graph from the text corpus and produces multiple question-answer pairs utilizing modern LLM capabilities. A set of diverse LLM-as-Judge metrics is provided for a comprehensive model evaluation. We used Russian news outlets to form the datasets and demonstrate our methodology. We launch a public leaderboard to track the development of RAG systems and encourage community participation.

DRAGOn: Designing RAG On Periodically Updated Corpus

TL;DR

DRAGOn presents a dynamic, periodically updated RAG benchmark built on a Russian news corpus to reflect real-world deployment and prevent data leakage. It combines a knowledge-graph-based QA generation pipeline, rigorous multi-stage QA filtering, and LLM-based judgments (via POLLUX) to curate high-quality QA pairs, culminating in a public, versioned leaderboard and open-source evaluation framework. The approach emphasizes reproducibility, modularity, and community-driven progress, with sandbox datasets for local validation and a scalable validation portal for secure evaluation on private ground-truth data. Overall, DRAGOn advances dynamic RAG evaluation by providing standardized workflows, transparent versioning, and a roadmap for multilingual and domain-variant extensions, enabling robust benchmarking of retriever-generator systems in evolving information landscapes.

Abstract

This paper introduces DRAGOn, method to design a RAG benchmark on a regularly updated corpus. It features recent reference datasets, a question generation framework, an automatic evaluation pipeline, and a public leaderboard. Specified reference datasets allow for uniform comparison of RAG systems, while newly generated dataset versions mitigate data leakage and ensure that all models are evaluated on unseen, comparable data. The pipeline for automatic question generation extracts the Knowledge Graph from the text corpus and produces multiple question-answer pairs utilizing modern LLM capabilities. A set of diverse LLM-as-Judge metrics is provided for a comprehensive model evaluation. We used Russian news outlets to form the datasets and demonstrate our methodology. We launch a public leaderboard to track the development of RAG systems and encourage community participation.

Paper Structure

This paper contains 60 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The DRAGOn logo.
  • Figure 2: Architecture of the benchmark system based on DRAGOn. All datasets are versioned and uploaded to Hugging Face with incrementally updated revision numbers. This versioning mechanism ensures reproducibility and provides users with stable snapshots for further experimentation.
  • Figure 3: Architecture of the Data Generation pipeline. Before the start of the KG extraction, we perform data deduplication as the news dump could contain multiple edited versions of the same article. We preserve only the latest version of the text with the same URL. Also we extract named entities for further question filtering.
  • Figure 4: Leaderboard interface.
  • Figure 5: Human evaluation system interface.
  • ...and 1 more figures