Table of Contents
Fetching ...

Aligning Large Language Model Behavior with Human Citation Preferences

Kenichiro Ando, Tatsuya Harada

TL;DR

The paper investigates how large language models decide which content should be cited and how this behavior aligns with human preferences. It builds a dataset of 6,000 Wikipedia sentences labeled into eight content categories and evaluates pairwise category preferences across 28 combos using 11 LLMs, revealing systematic misalignments such as overcitation of 'Citation needed' content and undercitation of numeric and person-name content. It further demonstrates that Direct Preference Optimization (DPO) can calibrate model behavior to better match human preferences, achieving notable gains (e.g., average ~ $5.76\%$ across evaluated models, with larger improvements for smaller models). The findings highlight the need to explicitly train and evaluate cite-worthiness, advocate category-aware citation routing, and lay a foundation for future, domain- and language-wide investigations into LLM citation behavior and verifiability.

Abstract

Most services built on powerful large-scale language models (LLMs) add citations to their output to enhance credibility. Recent research has paid increasing attention to the question of what reference documents to link to outputs. However, how LLMs recognize cite-worthiness and how this process should be controlled remains underexplored. In this study, we focus on what kinds of content LLMs currently tend to cite and how well that behavior aligns with human preferences. We construct a dataset to characterize the relationship between human citation preferences and LLM behavior. Web-derived texts are categorized into eight citation-motivation types, and pairwise citation preferences are exhaustively evaluated across all type combinations to capture fine-grained contrasts. Our results show that humans most frequently seek citations for medical text, and stronger models display a similar tendency. We also find that current models are as much as $27\%$ more likely than humans to add citations to text that is explicitly marked as needing citations on sources such as Wikipedia, and this overemphasis reduces alignment accuracy. Conversely, models systematically underselect numeric sentences (by $-22.6\%$ relative to humans) and sentences containing personal names (by $-20.1\%$), categories for which humans typically demand citations. Furthermore, experiments with Direct Preference Optimization demonstrate that model behavior can be calibrated to better match human citation preferences. We expect this study to provide a foundation for more fine-grained investigations into LLM citation preferences.

Aligning Large Language Model Behavior with Human Citation Preferences

TL;DR

The paper investigates how large language models decide which content should be cited and how this behavior aligns with human preferences. It builds a dataset of 6,000 Wikipedia sentences labeled into eight content categories and evaluates pairwise category preferences across 28 combos using 11 LLMs, revealing systematic misalignments such as overcitation of 'Citation needed' content and undercitation of numeric and person-name content. It further demonstrates that Direct Preference Optimization (DPO) can calibrate model behavior to better match human preferences, achieving notable gains (e.g., average ~ across evaluated models, with larger improvements for smaller models). The findings highlight the need to explicitly train and evaluate cite-worthiness, advocate category-aware citation routing, and lay a foundation for future, domain- and language-wide investigations into LLM citation behavior and verifiability.

Abstract

Most services built on powerful large-scale language models (LLMs) add citations to their output to enhance credibility. Recent research has paid increasing attention to the question of what reference documents to link to outputs. However, how LLMs recognize cite-worthiness and how this process should be controlled remains underexplored. In this study, we focus on what kinds of content LLMs currently tend to cite and how well that behavior aligns with human preferences. We construct a dataset to characterize the relationship between human citation preferences and LLM behavior. Web-derived texts are categorized into eight citation-motivation types, and pairwise citation preferences are exhaustively evaluated across all type combinations to capture fine-grained contrasts. Our results show that humans most frequently seek citations for medical text, and stronger models display a similar tendency. We also find that current models are as much as more likely than humans to add citations to text that is explicitly marked as needing citations on sources such as Wikipedia, and this overemphasis reduces alignment accuracy. Conversely, models systematically underselect numeric sentences (by relative to humans) and sentences containing personal names (by ), categories for which humans typically demand citations. Furthermore, experiments with Direct Preference Optimization demonstrate that model behavior can be calibrated to better match human citation preferences. We expect this study to provide a foundation for more fine-grained investigations into LLM citation preferences.
Paper Structure (24 sections, 2 equations, 7 tables)