MetaKP: On-Demand Keyphrase Generation
Di Wu, Xiaoxian Shen, Kai-Wei Chang
TL;DR
The paper defines on-demand keyphrase generation to meet diverse user intents and introduces MetaKP, a large-scale, multi-domain benchmark with 7,500 documents and 3,760 open-vocabulary goals across news and biomedicine. It develops two modeling paradigms—supervised multi-task fine-tuning and unsupervised prompting with self-consistency—and demonstrates that self-consistency prompting with GPT-4o achieves 0.548 SemF1, surpassing a fully fine-tuned BART-base in many settings and proving more robust under distribution shift. MetaKP enables two evaluation tasks: goal relevance assessment using Abstain F1 and goal-oriented keyphrase generation using SemF1 and SR, revealing strong abstention and generation performance for LLMs while exposing limitations of supervised fine-tuning under domain shifts. The work further showcases the framework’s practical potential for epidemic event detection and outlines broad applications, emphasizing the need for scalable, goal-driven NLP infrastructure and future expansion to multilingual domains and flexible instruction designs.
Abstract
Traditional keyphrase prediction methods predict a single set of keyphrases per document, failing to cater to the diverse needs of users and downstream applications. To bridge the gap, we introduce on-demand keyphrase generation, a novel paradigm that requires keyphrases that conform to specific high-level goals or intents. For this task, we present MetaKP, a large-scale benchmark comprising four datasets, 7500 documents, and 3760 goals across news and biomedical domains with human-annotated keyphrases. Leveraging MetaKP, we design both supervised and unsupervised methods, including a multi-task fine-tuning approach and a self-consistency prompting method with large language models. The results highlight the challenges of supervised fine-tuning, whose performance is not robust to distribution shifts. By contrast, the proposed self-consistency prompting approach greatly improves the performance of large language models, enabling GPT-4o to achieve 0.548 SemF1, surpassing the performance of a fully fine-tuned BART-base model. Finally, we demonstrate the potential of our method to serve as a general NLP infrastructure, exemplified by its application in epidemic event detection from social media.
