Table of Contents
Fetching ...

MetaKP: On-Demand Keyphrase Generation

Di Wu, Xiaoxian Shen, Kai-Wei Chang

TL;DR

The paper defines on-demand keyphrase generation to meet diverse user intents and introduces MetaKP, a large-scale, multi-domain benchmark with 7,500 documents and 3,760 open-vocabulary goals across news and biomedicine. It develops two modeling paradigms—supervised multi-task fine-tuning and unsupervised prompting with self-consistency—and demonstrates that self-consistency prompting with GPT-4o achieves 0.548 SemF1, surpassing a fully fine-tuned BART-base in many settings and proving more robust under distribution shift. MetaKP enables two evaluation tasks: goal relevance assessment using Abstain F1 and goal-oriented keyphrase generation using SemF1 and SR, revealing strong abstention and generation performance for LLMs while exposing limitations of supervised fine-tuning under domain shifts. The work further showcases the framework’s practical potential for epidemic event detection and outlines broad applications, emphasizing the need for scalable, goal-driven NLP infrastructure and future expansion to multilingual domains and flexible instruction designs.

Abstract

Traditional keyphrase prediction methods predict a single set of keyphrases per document, failing to cater to the diverse needs of users and downstream applications. To bridge the gap, we introduce on-demand keyphrase generation, a novel paradigm that requires keyphrases that conform to specific high-level goals or intents. For this task, we present MetaKP, a large-scale benchmark comprising four datasets, 7500 documents, and 3760 goals across news and biomedical domains with human-annotated keyphrases. Leveraging MetaKP, we design both supervised and unsupervised methods, including a multi-task fine-tuning approach and a self-consistency prompting method with large language models. The results highlight the challenges of supervised fine-tuning, whose performance is not robust to distribution shifts. By contrast, the proposed self-consistency prompting approach greatly improves the performance of large language models, enabling GPT-4o to achieve 0.548 SemF1, surpassing the performance of a fully fine-tuned BART-base model. Finally, we demonstrate the potential of our method to serve as a general NLP infrastructure, exemplified by its application in epidemic event detection from social media.

MetaKP: On-Demand Keyphrase Generation

TL;DR

The paper defines on-demand keyphrase generation to meet diverse user intents and introduces MetaKP, a large-scale, multi-domain benchmark with 7,500 documents and 3,760 open-vocabulary goals across news and biomedicine. It develops two modeling paradigms—supervised multi-task fine-tuning and unsupervised prompting with self-consistency—and demonstrates that self-consistency prompting with GPT-4o achieves 0.548 SemF1, surpassing a fully fine-tuned BART-base in many settings and proving more robust under distribution shift. MetaKP enables two evaluation tasks: goal relevance assessment using Abstain F1 and goal-oriented keyphrase generation using SemF1 and SR, revealing strong abstention and generation performance for LLMs while exposing limitations of supervised fine-tuning under domain shifts. The work further showcases the framework’s practical potential for epidemic event detection and outlines broad applications, emphasizing the need for scalable, goal-driven NLP infrastructure and future expansion to multilingual domains and flexible instruction designs.

Abstract

Traditional keyphrase prediction methods predict a single set of keyphrases per document, failing to cater to the diverse needs of users and downstream applications. To bridge the gap, we introduce on-demand keyphrase generation, a novel paradigm that requires keyphrases that conform to specific high-level goals or intents. For this task, we present MetaKP, a large-scale benchmark comprising four datasets, 7500 documents, and 3760 goals across news and biomedical domains with human-annotated keyphrases. Leveraging MetaKP, we design both supervised and unsupervised methods, including a multi-task fine-tuning approach and a self-consistency prompting method with large language models. The results highlight the challenges of supervised fine-tuning, whose performance is not robust to distribution shifts. By contrast, the proposed self-consistency prompting approach greatly improves the performance of large language models, enabling GPT-4o to achieve 0.548 SemF1, surpassing the performance of a fully fine-tuned BART-base model. Finally, we demonstrate the potential of our method to serve as a general NLP infrastructure, exemplified by its application in epidemic event detection from social media.
Paper Structure (54 sections, 1 equation, 14 figures, 6 tables)

This paper contains 54 sections, 1 equation, 14 figures, 6 tables.

Figures (14)

  • Figure 1: An illustration of on-demand keyphrase generation. Given diverse user goals, models are required to generate goal-conforming keyphrases or abstain.
  • Figure 2: The annotation pipeline for MetaKP. Starting from human-annotated keyphrases, GPT-4 is instructed to propose high-level goals and self-refine them. Finally, the goals are validated and filtered by humans.
  • Figure 3: A visualization of the goal distribution for the news domain (top) and the biomedical domain (bottom). MetaKP features both high-frequency goals and a diverse long-tail goal distribution.
  • Figure 4: A visualization of the inference process of the proposed sequence-to-sequence generation approach. Based on the document and the goal prefix, the model self-decides the relevance of the goal and selectively generates the keyphrases for relevant goals only.
  • Figure 5: Goal relevance judgment results of different types of models. Zero-shot prompting LLMs achieves a high performance, despite slightly falling below supervised models. Also, GPT-4o does not improve over GPT-3.5-Turbo.
  • ...and 9 more figures