Auto-ARGUE: LLM-Based Report Generation Evaluation
William Walden, Marc Mason, Orion Weller, Laura Dietz, John Conroy, Neil Molino, Hannah Recknor, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, Dawn Lawrie, James Mayfield, Eugene Yang
TL;DR
The paper addresses the lack of automatic, RG-specific evaluation tools for long-form, citation-backed reports that must reflect user-specific needs. It introduces Auto-ARGUE, an LLM-based implementation of the ARGUE framework, plus ARGUE-viz for visualization, and demonstrates their use on the TREC 2024 NeuCLIR RG pilot, evaluated over 51 runs across 21 topics with 10–20 nuggets per topic. Auto-ARGUE uses a few-shot prompted LLM judge to assess sentence-level content and citations, computes nugget recall and sentence precision, and reports F1-based aggregates, with open-source code and a Streamlit frontend. Results show strong agreement with human judgments in system rankings, supporting scalable automatic RG evaluation and broad adoption.
Abstract
Generation of long-form, citation-backed reports is a primary use case for retrieval augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, ones tailored to report generation (RG) are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for RG evaluation. We present analysis of Auto-ARGUE on the RG pilot task from the TREC 2024 NeuCLIR track, showing good system-level correlations with human judgments. We further release a web app for visualization of Auto-ARGUE outputs.
