Table of Contents
Fetching ...

Auto-ARGUE: LLM-Based Report Generation Evaluation

William Walden, Marc Mason, Orion Weller, Laura Dietz, John Conroy, Neil Molino, Hannah Recknor, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, Dawn Lawrie, James Mayfield, Eugene Yang

TL;DR

The paper addresses the lack of automatic, RG-specific evaluation tools for long-form, citation-backed reports that must reflect user-specific needs. It introduces Auto-ARGUE, an LLM-based implementation of the ARGUE framework, plus ARGUE-viz for visualization, and demonstrates their use on the TREC 2024 NeuCLIR RG pilot, evaluated over 51 runs across 21 topics with 10–20 nuggets per topic. Auto-ARGUE uses a few-shot prompted LLM judge to assess sentence-level content and citations, computes nugget recall and sentence precision, and reports F1-based aggregates, with open-source code and a Streamlit frontend. Results show strong agreement with human judgments in system rankings, supporting scalable automatic RG evaluation and broad adoption.

Abstract

Generation of long-form, citation-backed reports is a primary use case for retrieval augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, ones tailored to report generation (RG) are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for RG evaluation. We present analysis of Auto-ARGUE on the RG pilot task from the TREC 2024 NeuCLIR track, showing good system-level correlations with human judgments. We further release a web app for visualization of Auto-ARGUE outputs.

Auto-ARGUE: LLM-Based Report Generation Evaluation

TL;DR

The paper addresses the lack of automatic, RG-specific evaluation tools for long-form, citation-backed reports that must reflect user-specific needs. It introduces Auto-ARGUE, an LLM-based implementation of the ARGUE framework, plus ARGUE-viz for visualization, and demonstrates their use on the TREC 2024 NeuCLIR RG pilot, evaluated over 51 runs across 21 topics with 10–20 nuggets per topic. Auto-ARGUE uses a few-shot prompted LLM judge to assess sentence-level content and citations, computes nugget recall and sentence precision, and reports F1-based aggregates, with open-source code and a Streamlit frontend. Results show strong agreement with human judgments in system rankings, supporting scalable automatic RG evaluation and broad adoption.

Abstract

Generation of long-form, citation-backed reports is a primary use case for retrieval augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, ones tailored to report generation (RG) are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for RG evaluation. We present analysis of Auto-ARGUE on the RG pilot task from the TREC 2024 NeuCLIR track, showing good system-level correlations with human judgments. We further release a web app for visualization of Auto-ARGUE outputs.

Paper Structure

This paper contains 9 sections, 2 figures.

Figures (2)

  • Figure 1: The ARGUE framework from mayfield2024evaluation, adapted with permission.
  • Figure 2: Auto-ARGUE vs. human agreement on system rankings based on sentence precision (left) and nugget recall (right) for the TREC 2024 NeuCLIR RG pilot task.