Table of Contents
Fetching ...

Overview of the TREC 2025 RAGTIME Track

Dawn Lawrie, Sean MacAvaney, James Mayfield, Luca Soldaini, Eugene Yang, Andrew Yates

TL;DR

This paper presents the inaugural RAGTIME track at TREC, focusing on retrieval-augmented, multilingual long-form report generation across Arabic, Chinese, English, and Russian. It defines three tasks—Multilingual Report Generation, Monolingual English Report Generation, and Multilingual Information Retrieval—along with a shared document collection and evaluation framework inspired by prior work on evaluation and ARGUE-based assessment. The assessment workflow comprises four phases (document relevance, nugget creation, citation assessment, nugget matching) complemented by automatic evaluation using AutoARGUE and a dedicated retrieval service; development data from NeuCLIR 2024 supports task calibration. Results from 13 teams and 125 runs reveal strong sentence grounding but relatively lower nugget coverage, underscoring the influential role of retrieval and the need for improved nugget capture, with planned expansion in 2026 to include an Autonuggetization task. Overall, RAGTIME establishes a foundation for multilingual RAG research and reusable evaluation resources, signaling continued development and broader adoption in future years.

Abstract

The principal goal of the RAG TREC Instrument for Multilingual Evaluation (RAGTIME) track at TREC is to study report generation from multilingual source documents. The track has created a document collection containing Arabic, Chinese, English, and Russian news stories. RAGTIME includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR). A total of 125 runs were submitted by 13 participating teams (and as baselines by the track coordinators) for three tasks. This overview describes these three tasks and presents the available results.

Overview of the TREC 2025 RAGTIME Track

TL;DR

This paper presents the inaugural RAGTIME track at TREC, focusing on retrieval-augmented, multilingual long-form report generation across Arabic, Chinese, English, and Russian. It defines three tasks—Multilingual Report Generation, Monolingual English Report Generation, and Multilingual Information Retrieval—along with a shared document collection and evaluation framework inspired by prior work on evaluation and ARGUE-based assessment. The assessment workflow comprises four phases (document relevance, nugget creation, citation assessment, nugget matching) complemented by automatic evaluation using AutoARGUE and a dedicated retrieval service; development data from NeuCLIR 2024 supports task calibration. Results from 13 teams and 125 runs reveal strong sentence grounding but relatively lower nugget coverage, underscoring the influential role of retrieval and the need for improved nugget capture, with planned expansion in 2026 to include an Autonuggetization task. Overall, RAGTIME establishes a foundation for multilingual RAG research and reusable evaluation resources, signaling continued development and broader adoption in future years.

Abstract

The principal goal of the RAG TREC Instrument for Multilingual Evaluation (RAGTIME) track at TREC is to study report generation from multilingual source documents. The track has created a document collection containing Arabic, Chinese, English, and Russian news stories. RAGTIME includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR). A total of 125 runs were submitted by 13 participating teams (and as baselines by the track coordinators) for three tasks. This overview describes these three tasks and presents the available results.
Paper Structure (26 sections, 3 figures, 7 tables)

This paper contains 26 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Average citation overlap over all main task (1001 to 1122) topics between all run pairs. White cell indicates no valid topic ID was found in both runs of the pair.
  • Figure 2: Report generation results on 14 topics evaluated with automatic evaluation. Each bar represents one submission and is colored by its owner team. Runs marked with circles are submitted by teams involving at least one track coordinator.
  • Figure 3: Box plot of the metric values of each topic. Topics are sorted by the median of F1 scores.