Pre-trained Language Models for Keyphrase Generation: A Thorough Empirical Study
Di Wu, Wasi Uddin Ahmad, Kai-Wei Chang
TL;DR
The paper addresses how pre-trained language models compare to non-PLM baselines for keyphrase extraction and generation, formulating extraction as sequence labeling and generation as seq2seq. It conducts a comprehensive empirical study across science, news, and online forums, evaluating encoder-decoder, encoder-only, and domain-specific PLMs under various design choices, including in-domain pre-training and parameter budgets. Key findings show that large or in-domain seq2seq PLMs approach state-of-the-art performance, while in-domain encoder-only PLMs offer strong, data-efficient results; deeper encoders with shallower decoders are generally preferred under fixed budgets, and task formulation (extraction vs generation) interacts with model choice. The work provides concrete guidance for practitioners and contributes pre-trained domain-specific models (SciBART, NewsBART, NewsBERT) to advance keyphrase generation in scientific and real-world settings. It further demonstrates that task-specific pre-training does not fully replace in-domain adaptation, underscoring the importance of domain-aware fine-tuning for reliable keyphrase generation.
Abstract
Neural models that do not rely on pre-training have excelled in the keyphrase generation task with large annotated datasets. Meanwhile, new approaches have incorporated pre-trained language models (PLMs) for their data efficiency. However, there lacks a systematic study of how the two types of approaches compare and how different design choices can affect the performance of PLM-based models. To fill in this knowledge gap and facilitate a more informed use of PLMs for keyphrase extraction and keyphrase generation, we present an in-depth empirical study. Formulating keyphrase extraction as sequence labeling and keyphrase generation as sequence-to-sequence generation, we perform extensive experiments in three domains. After showing that PLMs have competitive high-resource performance and state-of-the-art low-resource performance, we investigate important design choices including in-domain PLMs, PLMs with different pre-training objectives, using PLMs with a parameter budget, and different formulations for present keyphrases. Further results show that (1) in-domain BERT-like PLMs can be used to build strong and data-efficient keyphrase generation models; (2) with a fixed parameter budget, prioritizing model depth over width and allocating more layers in the encoder leads to better encoder-decoder models; and (3) introducing four in-domain PLMs, we achieve a competitive performance in the news domain and the state-of-the-art performance in the scientific domain.
