Systematic Task Exploration with LLMs: A Study in Citation Text Generation

Furkan Şahinuç; Ilia Kuznetsov; Yufang Hou; Iryna Gurevych

Systematic Task Exploration with LLMs: A Study in Citation Text Generation

Furkan Şahinuç, Ilia Kuznetsov, Yufang Hou, Iryna Gurevych

TL;DR

The paper tackles how to systematically study creative NLG tasks enabled by LLMs by proposing a three-component framework that combines systematic input manipulation, a rich reference dataset, and a broad measurement kit. It applies this framework to citation-text generation, revealing that input configurations and task instructions jointly shape outputs and that free-form intents outperform categorical ones when steering generation. By comparing two LLMs and conducting human evaluations, the work demonstrates the value of multi-metric assessment, including NLI-based measures, for capturing factual alignment beyond traditional ROUGE scores. The findings offer practical guidance for prompting creative NLG tasks and provide a public dataset and code to enable reproducibility and broader exploration beyond citation text generation.

Abstract

Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks. Yet, this flexibility brings new challenges, as it introduces new degrees of freedom in formulating the task inputs and instructions and in evaluating model performance. To facilitate the exploration of creative NLG tasks, we propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement. We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric and has not yet been tackled within the LLM paradigm. Our results highlight the importance of systematically investigating both task instruction and input configuration when prompting LLMs, and reveal non-trivial relationships between different evaluation metrics used for citation text generation. Additional human generation and human evaluation experiments provide new qualitative insights into the task to guide future research in citation text generation. We make our code and data publicly available.

Systematic Task Exploration with LLMs: A Study in Citation Text Generation

TL;DR

Abstract

Paper Structure (37 sections, 5 figures, 15 tables)

This paper contains 37 sections, 5 figures, 15 tables.

Introduction
Background
LLMs and Prompting
Citation Text Generation
NLG Evaluation
Task and Method
Prompt Manipulation
Reference Data
Measurements
Experiments
Results
Human Evaluation
Evaluation.
Qualitative observations.
Conclusion
...and 22 more sections

Figures (5)

Figure 1: Citation text generation with LLMs. The task (1) is to generate a paragraph of related work from the citing paper (A) about a cited paper (B). The instruction combined with task inputs constitutes a prompt (2) that is communicated to the model. The model's response (3) is evaluated using a range of measurements, from word count to NLI-based factuality metrics (4).
Figure 2: Prompt manipulation combines the instruction (top) with input components (left) and the corresponding data (bottom) incl. free-form citation intent and example sentence. The result serves as LLM prompt as in Figure \ref{['fig:task_overview']}.
Figure 3: Conventional and NLI-based metric results. First two rows: ROGUE-L, BERTScore and SciBERTScore. Last two rows: BLEURT, TRUE and SummaC. Llama 2-Chat (13B) (above) and GPT 3.5 (below). Abstract + Intent (Free-form or Categorical) + Example, #Instruction color-coded.
Figure 4: Pearson correlation between instance-level measurements over all configurations. WC: word count, Sum: SummaC.
Figure 5: Citation count distribution in logarithmic scale

Systematic Task Exploration with LLMs: A Study in Citation Text Generation

TL;DR

Abstract

Systematic Task Exploration with LLMs: A Study in Citation Text Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)