Table of Contents
Fetching ...

A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks

Sara Rosenthal, Maeda Hanafi, Yannis Katsis, Lucian Popa, Marina Danilevsky

TL;DR

This study addresses how different annotator feedback loops affect the quality of complex multi-turn Retrieval-Augmented Generation data. It uses a longitudinal comparison between internal (direct feedback) and external (indirect feedback) annotators, employing the RAGAPHENE tool to create and review RAG conversations across Pilot, Creation, and Review phases. Key findings show that a closer feedback loop improves data quality and grounding diversity but reduces throughput, whereas the external group offers higher volume with lower depth, suggesting a hybrid workflow: external for volume and diversity, internal for quality assurance and guideline refinement. The work provides practical guidance for task design, tutorials, and tooling to optimize complex annotation tasks in RAG data creation and evaluation.

Abstract

Grounding conversations in existing passages, known as Retrieval-Augmented Generation (RAG), is an important aspect of Chat-Based Assistants powered by Large Language Models (LLMs) to ensure they are faithful and don't provide misinformation. Several benchmarks have been created to measure the performance of LLMs on this task. We present a longitudinal study comparing the feedback loop of an internal and external human annotator group for the complex annotation task of creating multi-turn RAG conversations for evaluating LLMs. We analyze the conversations produced by both groups and provide results of a survey comparing their experiences. Our study highlights the advantages of each annotator population and the impact of the different feedback loops; a closer loop creates higher quality conversations with a decrease in quantity and diversity. Further, we present guidance for how to best utilize two different population groups when performing annotation tasks, particularly when the task is complex.

A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks

TL;DR

This study addresses how different annotator feedback loops affect the quality of complex multi-turn Retrieval-Augmented Generation data. It uses a longitudinal comparison between internal (direct feedback) and external (indirect feedback) annotators, employing the RAGAPHENE tool to create and review RAG conversations across Pilot, Creation, and Review phases. Key findings show that a closer feedback loop improves data quality and grounding diversity but reduces throughput, whereas the external group offers higher volume with lower depth, suggesting a hybrid workflow: external for volume and diversity, internal for quality assurance and guideline refinement. The work provides practical guidance for task design, tutorials, and tooling to optimize complex annotation tasks in RAG data creation and evaluation.

Abstract

Grounding conversations in existing passages, known as Retrieval-Augmented Generation (RAG), is an important aspect of Chat-Based Assistants powered by Large Language Models (LLMs) to ensure they are faithful and don't provide misinformation. Several benchmarks have been created to measure the performance of LLMs on this task. We present a longitudinal study comparing the feedback loop of an internal and external human annotator group for the complex annotation task of creating multi-turn RAG conversations for evaluating LLMs. We analyze the conversations produced by both groups and provide results of a survey comparing their experiences. Our study highlights the advantages of each annotator population and the impact of the different feedback loops; a closer loop creates higher quality conversations with a decrease in quantity and diversity. Further, we present guidance for how to best utilize two different population groups when performing annotation tasks, particularly when the task is complex.

Paper Structure

This paper contains 30 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: A sample of what the structure of a multi-turn RAG conversation looks like. This conversation has two turns (a question from the human followed by an answer from the AI conversational agent). The first question is unanswerable and has no relevant passages. The second question is answerable and has three relevant passages that were found from two different queries (Q1 and Q2). Both answers were edited from the original answer by adding (green) and/or removing (red) text.
  • Figure 2: View of creating a conversation in RAGAPHENE, when an annotator creates an initial message, "I am looking for a good program to study financial planning that does not require test scores to enroll".
  • Figure 3: Checklist that an annotator has to go through before exporting the created conversation.
  • Figure 4: Examples of annotator and automated comments in review mode.
  • Figure 5: Comparison of original turn and more complex alternative. The original turn 2 was not complex. It provided the same three passages and repeated a lot of the same information in the answer for Turn 1. The New Turn 2: 1) is more ambiguous because it is a keyword, 2) it requires a significant amount of editing, and 3) it required re-querying which was used to find a new relevant passage as shown.
  • ...and 2 more figures