Table of Contents
Fetching ...

AttributionBench: How Hard is Automatic Attribution Evaluation?

Yifei Li, Xiang Yue, Zeyi Liao, Huan Sun

TL;DR

AttributionBench introduces a unified binary-attribution benchmark for automatic attribution evaluation, aggregating seven datasets into balanced train/dev/test splits with in-distribution and out-of-distribution test sets. The study demonstrates that even with fine-tuning, state-of-the-art systems achieve only moderate macro-F1 scores around the 80% range, highlighting substantial challenges in faithfully attributing claims to cited evidence. A detailed error analysis attributes most failures to fine-grained information insensitivity and mismatches between model-accessible and human-annotator-accessible information. The results emphasize that while domain-adapted fine-tuning (notably on NLI data) and benchmarking help, simply using larger models is insufficient, and future work should focus on aligning evidence processing with human judgments and improving sensitivity to fine-grained details.

Abstract

Modern generative search engines enhance the reliability of large language model (LLM) responses by providing cited evidence. However, evaluating the answer's attribution, i.e., whether every claim within the generated responses is fully supported by its cited evidence, remains an open problem. This verification, traditionally dependent on costly human evaluation, underscores the urgent need for automatic attribution evaluation methods. To bridge the gap in the absence of standardized benchmarks for these methods, we present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our extensive experiments on AttributionBench reveal the challenges of automatic attribution evaluation, even for state-of-the-art LLMs. Specifically, our findings show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation. A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model's inability to process nuanced information, and the discrepancy between the information the model has access to and that human annotators do.

AttributionBench: How Hard is Automatic Attribution Evaluation?

TL;DR

AttributionBench introduces a unified binary-attribution benchmark for automatic attribution evaluation, aggregating seven datasets into balanced train/dev/test splits with in-distribution and out-of-distribution test sets. The study demonstrates that even with fine-tuning, state-of-the-art systems achieve only moderate macro-F1 scores around the 80% range, highlighting substantial challenges in faithfully attributing claims to cited evidence. A detailed error analysis attributes most failures to fine-grained information insensitivity and mismatches between model-accessible and human-annotator-accessible information. The results emphasize that while domain-adapted fine-tuning (notably on NLI data) and benchmarking help, simply using larger models is insufficient, and future work should focus on aligning evidence processing with human judgments and improving sensitivity to fine-grained details.

Abstract

Modern generative search engines enhance the reliability of large language model (LLM) responses by providing cited evidence. However, evaluating the answer's attribution, i.e., whether every claim within the generated responses is fully supported by its cited evidence, remains an open problem. This verification, traditionally dependent on costly human evaluation, underscores the urgent need for automatic attribution evaluation methods. To bridge the gap in the absence of standardized benchmarks for these methods, we present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our extensive experiments on AttributionBench reveal the challenges of automatic attribution evaluation, even for state-of-the-art LLMs. Specifically, our findings show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation. A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model's inability to process nuanced information, and the discrepancy between the information the model has access to and that human annotators do.
Paper Structure (43 sections, 5 figures, 8 tables)

This paper contains 43 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The illustration of the attribution evaluation task and two typical error examples from AttributionBench generated by GPT-3.5 (w/ CoT). The references are usually manually extracted from webpages by human annotators based on what they think is useful. Left: fine-grained information insensitivity (i.e., the model disregarded or overlooked nuanced details in either the claim or the references, as well as failing to do necessary summarization or inference from the given references, tasks that humans naturally perform). Right: human-model accessible information mismatch (i.e., human annotators can see the whole webpage while the model is only given the extracted evidence, leading to different judgments.)
  • Figure 2: The average macro-F1 score on 7 test sets of GPT-3.5 with different input fields. Q, C, E, R stands for question, claim, evidence, and response, respectively. Results show that despite involving additional information, adding input fields cannot boost or even harm the performance.
  • Figure 3: The performance of GPT-3.5 (w/ CoT) with several different prompts. Prompt engineering only brings limited gain over 7 test sets. Rather than bringing little gain in overall performance (Fig \ref{['fig:impact_of_instructions_1']}), adjusting prompts is actually changing the ratio of FP and FN cases (Fig \ref{['fig:impact_of_instructions_2']}) "comp_und" stands for "comprehensive understanding".
  • Figure 4: The distribution of error types among 7 test sets.
  • Figure 5: The distribution of error types within the major error reason "fine-grained information insensitivity" among 7 test sets.