A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks
Sara Rosenthal, Maeda Hanafi, Yannis Katsis, Lucian Popa, Marina Danilevsky
TL;DR
This study addresses how different annotator feedback loops affect the quality of complex multi-turn Retrieval-Augmented Generation data. It uses a longitudinal comparison between internal (direct feedback) and external (indirect feedback) annotators, employing the RAGAPHENE tool to create and review RAG conversations across Pilot, Creation, and Review phases. Key findings show that a closer feedback loop improves data quality and grounding diversity but reduces throughput, whereas the external group offers higher volume with lower depth, suggesting a hybrid workflow: external for volume and diversity, internal for quality assurance and guideline refinement. The work provides practical guidance for task design, tutorials, and tooling to optimize complex annotation tasks in RAG data creation and evaluation.
Abstract
Grounding conversations in existing passages, known as Retrieval-Augmented Generation (RAG), is an important aspect of Chat-Based Assistants powered by Large Language Models (LLMs) to ensure they are faithful and don't provide misinformation. Several benchmarks have been created to measure the performance of LLMs on this task. We present a longitudinal study comparing the feedback loop of an internal and external human annotator group for the complex annotation task of creating multi-turn RAG conversations for evaluating LLMs. We analyze the conversations produced by both groups and provide results of a survey comparing their experiences. Our study highlights the advantages of each annotator population and the impact of the different feedback loops; a closer loop creates higher quality conversations with a decrease in quantity and diversity. Further, we present guidance for how to best utilize two different population groups when performing annotation tasks, particularly when the task is complex.
