Table of Contents
Fetching ...

ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented Argumentation with LLM Judges

Kaustubh D. Dhole, Kai Shu, Eugene Agichtein

TL;DR

This work defines Retrieval Augmented Argumentation (RAArg) and introduces a two-stage reference implementation using BM25 + LLM reranking and few-shot generation to create long, evidence-grounded arguments conditioned on stance. It then presents ConQRet, a benchmark built from ProCon.org with ground-truth sources to enable comprehensive evaluation of retrieval, grounding, and argument quality on real-world, lengthy documents. The core contribution is a suite of fine-grained LLM Judges that assess context relevance, groundedness, and argument quality, validated on ArgQuality and ConQRet datasets, showing stronger alignment with human judgments than previous single-score evaluations. The results demonstrate the feasibility and value of multi-dimensional, interpretable evaluation for retrieval-augmented generation tasks and propose this framework as a foundation for extending to other complex information-grounded generation problems.

Abstract

Computational argumentation, which involves generating answers or summaries for controversial topics like abortion bans and vaccination, has become increasingly important in today's polarized environment. Sophisticated LLM capabilities offer the potential to provide nuanced, evidence-based answers to such questions through Retrieval-Augmented Argumentation (RAArg), leveraging real-world evidence for high-quality, grounded arguments. However, evaluating RAArg remains challenging, as human evaluation is costly and difficult for complex, lengthy answers on complicated topics. At the same time, re-using existing argumentation datasets is no longer sufficient, as they lack long, complex arguments and realistic evidence from potentially misleading sources, limiting holistic evaluation of retrieval effectiveness and argument quality. To address these gaps, we investigate automated evaluation methods using multiple fine-grained LLM judges, providing better and more interpretable assessments than traditional single-score metrics and even previously reported human crowdsourcing. To validate the proposed techniques, we introduce ConQRet, a new benchmark featuring long and complex human-authored arguments on debated topics, grounded in real-world websites, allowing an exhaustive evaluation across retrieval effectiveness, argument quality, and groundedness. We validate our LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed LLM Judges and the ConQRet benchmark can enable rapid progress in computational argumentation and can be naturally extended to other complex retrieval-augmented generation tasks.

ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented Argumentation with LLM Judges

TL;DR

This work defines Retrieval Augmented Argumentation (RAArg) and introduces a two-stage reference implementation using BM25 + LLM reranking and few-shot generation to create long, evidence-grounded arguments conditioned on stance. It then presents ConQRet, a benchmark built from ProCon.org with ground-truth sources to enable comprehensive evaluation of retrieval, grounding, and argument quality on real-world, lengthy documents. The core contribution is a suite of fine-grained LLM Judges that assess context relevance, groundedness, and argument quality, validated on ArgQuality and ConQRet datasets, showing stronger alignment with human judgments than previous single-score evaluations. The results demonstrate the feasibility and value of multi-dimensional, interpretable evaluation for retrieval-augmented generation tasks and propose this framework as a foundation for extending to other complex information-grounded generation problems.

Abstract

Computational argumentation, which involves generating answers or summaries for controversial topics like abortion bans and vaccination, has become increasingly important in today's polarized environment. Sophisticated LLM capabilities offer the potential to provide nuanced, evidence-based answers to such questions through Retrieval-Augmented Argumentation (RAArg), leveraging real-world evidence for high-quality, grounded arguments. However, evaluating RAArg remains challenging, as human evaluation is costly and difficult for complex, lengthy answers on complicated topics. At the same time, re-using existing argumentation datasets is no longer sufficient, as they lack long, complex arguments and realistic evidence from potentially misleading sources, limiting holistic evaluation of retrieval effectiveness and argument quality. To address these gaps, we investigate automated evaluation methods using multiple fine-grained LLM judges, providing better and more interpretable assessments than traditional single-score metrics and even previously reported human crowdsourcing. To validate the proposed techniques, we introduce ConQRet, a new benchmark featuring long and complex human-authored arguments on debated topics, grounded in real-world websites, allowing an exhaustive evaluation across retrieval effectiveness, argument quality, and groundedness. We validate our LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed LLM Judges and the ConQRet benchmark can enable rapid progress in computational argumentation and can be naturally extended to other complex retrieval-augmented generation tasks.

Paper Structure

This paper contains 40 sections, 13 figures, 18 tables.

Figures (13)

  • Figure 1: Retrieval Augmented Argumentation (RAArg) and LLM Judges used for evaluation of RAArg.
  • Figure 2: Evidence Document Length Distribution. Most of the documents have 100 to around 10k tokens.
  • Figure 3: Hallucinated sentences (in purple) inserted into the argument by converting grounded sentences into sentences contradictory to the grounded evidence
  • Figure 4: Metric consistency analysis for gemini-1.5-flash (left) and GPT-4o (right): Comparing the influence of both -- increasing irrelevant content and increasing hallucinations -- to see their effects on the 3 metrics.
  • Figure 5: GPT-4o-mini RAG example output (top) and its LLM-Judge evaluation (bottom).
  • ...and 8 more figures