Table of Contents
Fetching ...

QUILL: Quotation Generation Enhancement of Large Language Models

Jin Xiao, Bowei Zhang, Qianyu He, Jiaqing Liang, Feng Wei, Jinglei Chen, Zujie Liang, Deqing Yang, Yanghua Xiao

TL;DR

This work tackles quotation generation in large language models by identifying pervasive issues of quotation hallucination, contextual misalignment, and limited novelty. It proposes QUILL, a framework that combines a five-criterion automatic evaluation, a large bilingual knowledge base of 32,022 quotes, and a quotation-specific reranking metric to improve retrieval-augmented generation for QR tasks. The main contributions include a holistic evaluation system, a rigorously curated multilingual quotation corpus, and a fine-grained reranking mechanism that correlates strongly with human preferences and enhances performance across open- and closed-source models. The approach reduces quotation hallucination, strengthens authenticity and credibility of inserted quotes, and provides publicly available data and code to advance research and practical deployment in QG systems.

Abstract

While Large language models (LLMs) have become excellent writing assistants, they still struggle with quotation generation. This is because they either hallucinate when providing factual quotations or fail to provide quotes that exceed human expectations. To bridge the gap, we systematically study how to evaluate and improve LLMs' performance in quotation generation tasks. We first establish a holistic and automatic evaluation system for quotation generation task, which consists of five criteria each with corresponding automatic metric. To improve the LLMs' quotation generation abilities, we construct a bilingual knowledge base that is broad in scope and rich in dimensions, containing up to 32,022 quotes. Moreover, guided by our critiria, we further design a quotation-specific metric to rerank the retrieved quotations from the knowledge base. Extensive experiments show that our metrics strongly correlate with human preferences. Existing LLMs struggle to generate desired quotes, but our quotation knowledge base and reranking metric help narrow this gap. Our dataset and code are publicly available at https://github.com/GraceXiaoo/QUILL.

QUILL: Quotation Generation Enhancement of Large Language Models

TL;DR

This work tackles quotation generation in large language models by identifying pervasive issues of quotation hallucination, contextual misalignment, and limited novelty. It proposes QUILL, a framework that combines a five-criterion automatic evaluation, a large bilingual knowledge base of 32,022 quotes, and a quotation-specific reranking metric to improve retrieval-augmented generation for QR tasks. The main contributions include a holistic evaluation system, a rigorously curated multilingual quotation corpus, and a fine-grained reranking mechanism that correlates strongly with human preferences and enhances performance across open- and closed-source models. The approach reduces quotation hallucination, strengthens authenticity and credibility of inserted quotes, and provides publicly available data and code to advance research and practical deployment in QG systems.

Abstract

While Large language models (LLMs) have become excellent writing assistants, they still struggle with quotation generation. This is because they either hallucinate when providing factual quotations or fail to provide quotes that exceed human expectations. To bridge the gap, we systematically study how to evaluate and improve LLMs' performance in quotation generation tasks. We first establish a holistic and automatic evaluation system for quotation generation task, which consists of five criteria each with corresponding automatic metric. To improve the LLMs' quotation generation abilities, we construct a bilingual knowledge base that is broad in scope and rich in dimensions, containing up to 32,022 quotes. Moreover, guided by our critiria, we further design a quotation-specific metric to rerank the retrieved quotations from the knowledge base. Extensive experiments show that our metrics strongly correlate with human preferences. Existing LLMs struggle to generate desired quotes, but our quotation knowledge base and reranking metric help narrow this gap. Our dataset and code are publicly available at https://github.com/GraceXiaoo/QUILL.

Paper Structure

This paper contains 40 sections, 13 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: An example of prevalent issues in Quotation Generation (QR) by LLMs. In QR tasks, LLMs often fabricate sentences, leading to quotation hallucination. Additionally, the generated quotes may not align with the context, resulting in contextual inconsistency and semantic incoherence. Finally, the sentences produced by LLMs tend to be overly common, resulting in a lack of novelty in quotations.
  • Figure 2: The framework for our Quotation Generation (QG) task research. We first establish an evaluation system with 5 evaluation criteria and automatic metrics, then build a quotation knowledge base covering multiple languages, topics and eras, and finally propose a quotation-specific reranking metric to rerank the quotations recalled in the RAG stage and improve the performance of QG tasks.
  • Figure 3: 7 common categories and 21 scenarios details of the evaluation dataset.
  • Figure 4: Correlation between our automatic evaluation metrics and human ratings. To avoid overlapping points, random jitters sampled from $N (0, {0.05}^2)$ were added to human ratings after fitting the regression.
  • Figure 5: The specific topic distribution of the English quotation corpus.
  • ...and 1 more figures