ALiiCE: Evaluating Positional Fine-grained Citation Generation
Yilong Xu, Jinhua Gao, Xiaoming Yu, Baolong Bi, Huawei Shen, Xueqi Cheng
TL;DR
ALiiCE tackles the problem of evaluating positional fine-grained inline citations in long-form QA by introducing a dependency-tree parsing pipeline to extract atomic claims and three targeted metrics: positional recall, positional precision, and the coefficient of variation of citation positions (CVCP). The framework enables assessment at the atomic-claim level, addressing limitations of sentence-level evaluation and improving alignment with user verifiability. Experiments on ASQA and ELI5 with GPT-3.5, GPT-4, and LLaMA-3 reveal that current LLMs generate few positional fine-grained citations and that open-source models are making meaningful progress. Human evaluation corroborates ALiiCE’s judgments and highlights a decoupling between citation quality and citation utility, suggesting future work on measuring utility and constructing reasoning paths for multi-hop retrieval. Overall, ALiiCE provides a principled, automatic benchmark for positional fine-grained citation generation and invites further exploration into more nuanced citation evaluation and generation strategies.
Abstract
Large Language Models (LLMs) can enhance the credibility and verifiability by generating text with citations. However, existing tasks and evaluation methods are predominantly limited to sentence-level statement, neglecting the significance of positional fine-grained citations that can appear anywhere within sentences. To facilitate further exploration of the fine-grained citation generation, we propose ALiiCE, the first automatic evaluation framework for this task. Our framework first parses the sentence claim into atomic claims via dependency analysis and then calculates citation quality at the atomic claim level. ALiiCE introduces three novel metrics for positional fined-grained citation quality assessment, including positional fine-grained citation recall and precision, and coefficient of variation of citation positions. We evaluate the positional fine-grained citation generation performance of several LLMs on two long-form QA datasets. Our experiments and analyses demonstrate the effectiveness and reasonableness of ALiiCE. The results also indicate that existing LLMs still struggle to provide positional fine-grained citations.
