Table of Contents
Fetching ...

A Workbench for Autograding Retrieve/Generate Systems

Laura Dietz

TL;DR

This paper addresses the failings of traditional passage-level IR evaluation in the era of autoregressive LLMs, where outputs vary across runs and systems. It proposes the Autograding Workbench, a four-phase framework that uses LLM-powered graders to assess nugget coverage, exam-question answerability, or direct grading, all while enabling human oversight to ensure trustworthiness. The workbench provides a data model, prompts, and evaluation workflows that produce trec_eval-compatible qrels and a novel Autograde Cover metric, enabling reusable test banks and comparisons across methods. A TREC DL 2020 walk-through demonstrates practical applicability, showing high correlation with official leaderboards and highlighting the framework’s potential to accelerate robust, reusable IR evaluation for retrieval-augmented generation systems. The resource aims to foster reproducible research and flexible evaluation design through open-source tooling and clear integration with established evaluation pipelines.

Abstract

This resource paper addresses the challenge of evaluating Information Retrieval (IR) systems in the era of autoregressive Large Language Models (LLMs). Traditional methods relying on passage-level judgments are no longer effective due to the diversity of responses generated by LLM-based systems. We provide a workbench to explore several alternative evaluation approaches to judge the relevance of a system's response that incorporate LLMs: 1. Asking an LLM whether the response is relevant; 2. Asking the LLM which set of nuggets (i.e., relevant key facts) is covered in the response; 3. Asking the LLM to answer a set of exam questions with the response. This workbench aims to facilitate the development of new, reusable test collections. Researchers can manually refine sets of nuggets and exam questions, observing their impact on system evaluation and leaderboard rankings. Resource available at https://github.com/TREMA-UNH/autograding-workbench

A Workbench for Autograding Retrieve/Generate Systems

TL;DR

This paper addresses the failings of traditional passage-level IR evaluation in the era of autoregressive LLMs, where outputs vary across runs and systems. It proposes the Autograding Workbench, a four-phase framework that uses LLM-powered graders to assess nugget coverage, exam-question answerability, or direct grading, all while enabling human oversight to ensure trustworthiness. The workbench provides a data model, prompts, and evaluation workflows that produce trec_eval-compatible qrels and a novel Autograde Cover metric, enabling reusable test banks and comparisons across methods. A TREC DL 2020 walk-through demonstrates practical applicability, showing high correlation with official leaderboards and highlighting the framework’s potential to accelerate robust, reusable IR evaluation for retrieval-augmented generation systems. The resource aims to foster reproducible research and flexible evaluation design through open-source tooling and clear integration with established evaluation pipelines.

Abstract

This resource paper addresses the challenge of evaluating Information Retrieval (IR) systems in the era of autoregressive Large Language Models (LLMs). Traditional methods relying on passage-level judgments are no longer effective due to the diversity of responses generated by LLM-based systems. We provide a workbench to explore several alternative evaluation approaches to judge the relevance of a system's response that incorporate LLMs: 1. Asking an LLM whether the response is relevant; 2. Asking the LLM which set of nuggets (i.e., relevant key facts) is covered in the response; 3. Asking the LLM to answer a set of exam questions with the response. This workbench aims to facilitate the development of new, reusable test collections. Researchers can manually refine sets of nuggets and exam questions, observing their impact on system evaluation and leaderboard rankings. Resource available at https://github.com/TREMA-UNH/autograding-workbench
Paper Structure (23 sections, 4 figures, 5 tables)

This paper contains 23 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Autograding Process: Phase 1: Test bank creation with semi-automatic methods. Phase 2: Automatic grading with prompt-based LLMs. Phase 3: Manual verification an oversight. Phase 4: Evaluation via trec_eval or Autograde Cover. Results from any phase are used by the human-in-the-loop to refine the test bank and adjust prompts to ensure that the automatic grading agrees with the human understanding of relevance.
  • Figure 2: Example question generated for TREC DL 2020.
  • Figure 3: Example nugget generated for TREC DL 2020.
  • Figure 4: Data Model. Query, passage text and ID must be provided. If available, manual judgment level and system information can be used for analysis and verification in Phase 3 and 4. Phase 2 adds the fields exam_grades and/or grades with information about correct questions/nuggets, self-ratings of answerability, and answers for manual verification. All phases support filtering based on fields llm and prompt_class.