ALiiCE: Evaluating Positional Fine-grained Citation Generation

Yilong Xu; Jinhua Gao; Xiaoming Yu; Baolong Bi; Huawei Shen; Xueqi Cheng

ALiiCE: Evaluating Positional Fine-grained Citation Generation

Yilong Xu, Jinhua Gao, Xiaoming Yu, Baolong Bi, Huawei Shen, Xueqi Cheng

TL;DR

ALiiCE tackles the problem of evaluating positional fine-grained inline citations in long-form QA by introducing a dependency-tree parsing pipeline to extract atomic claims and three targeted metrics: positional recall, positional precision, and the coefficient of variation of citation positions (CVCP). The framework enables assessment at the atomic-claim level, addressing limitations of sentence-level evaluation and improving alignment with user verifiability. Experiments on ASQA and ELI5 with GPT-3.5, GPT-4, and LLaMA-3 reveal that current LLMs generate few positional fine-grained citations and that open-source models are making meaningful progress. Human evaluation corroborates ALiiCE’s judgments and highlights a decoupling between citation quality and citation utility, suggesting future work on measuring utility and constructing reasoning paths for multi-hop retrieval. Overall, ALiiCE provides a principled, automatic benchmark for positional fine-grained citation generation and invites further exploration into more nuanced citation evaluation and generation strategies.

Abstract

Large Language Models (LLMs) can enhance the credibility and verifiability by generating text with citations. However, existing tasks and evaluation methods are predominantly limited to sentence-level statement, neglecting the significance of positional fine-grained citations that can appear anywhere within sentences. To facilitate further exploration of the fine-grained citation generation, we propose ALiiCE, the first automatic evaluation framework for this task. Our framework first parses the sentence claim into atomic claims via dependency analysis and then calculates citation quality at the atomic claim level. ALiiCE introduces three novel metrics for positional fined-grained citation quality assessment, including positional fine-grained citation recall and precision, and coefficient of variation of citation positions. We evaluate the positional fine-grained citation generation performance of several LLMs on two long-form QA datasets. Our experiments and analyses demonstrate the effectiveness and reasonableness of ALiiCE. The results also indicate that existing LLMs still struggle to provide positional fine-grained citations.

ALiiCE: Evaluating Positional Fine-grained Citation Generation

TL;DR

Abstract

Paper Structure (58 sections, 4 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 58 sections, 4 equations, 13 figures, 8 tables, 1 algorithm.

Introduction
Background & Task Definition
Citation Generation In Long-form QA
Task Definition
ALiiCE: Automatic LLMs' Positional Fine-grained Citation Evaluation
Dependency Tree
Parsing Pipeline
Metrics For Citation Quality
Positional Fine-grained Citation Recall
Positional Fine-grained Citation Precision
Coefficient of Variation of Citation Positions
Experimental Setup
Datasets
Implementation
Models
...and 43 more sections

Figures (13)

Figure 1: "Sentence-level" vs. "Any-level" in the task of citation text generation. The text with grey underline corresponds to the claim in A1 cited by "[1][2][3]". The texts of orange and blue underlines correspond to the claims in A2 cited by "[1]" and "[2][3]", respectively.
Figure 2: An example of ALiiCE evaluation framework on positional fine-grained citation generation. Given a query and related documents, the LLM generate a long-form answer. For sentence $i$ in answer, the parsing pipeline involves constructing the dependency tree, identifying the LCA node to obtain the modified tree of each claim, and converting modified trees into texts. Finally, we calculate the citation recall and precision for each claim.
Figure 3: Evaluation process of citation quality by ALCE and ALiiCE on two examples from ASQA. The answers are generated by GPT-3.5 (5-psg).
Figure 4: Comparison of citation recall and precision between ALCE and ALiiCE across three models using the 5-psg setting on ASQA. ALiiCE achieves lower citation recall and higher citation precision.
Figure 5: The dependency tree of sentence "In the plane crash on Grey's Anatomy, the characters who die are Dr. Lexie Grey [1][2] and Dr. Mark Sloan [3][4][5].", from the response generated by GPT-3.5 (5-psg). The query is "Who dies in the plane crash on greys?" from ASQA. The modified tree of claim corresponds to citation "[1][2]" is shown at Figure \ref{['fig:tree_6_15']}. The modified tree of claim corresponds to citation "[3][4][5]" is shown at Figure \ref{['fig:tree_6_19']}.
...and 8 more figures

ALiiCE: Evaluating Positional Fine-grained Citation Generation

TL;DR

Abstract

ALiiCE: Evaluating Positional Fine-grained Citation Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)