Table of Contents
Fetching ...

Language Model as an Annotator: Unsupervised Context-aware Quality Phrase Generation

Zhihao Zhang, Yuan Zuo, Chenghua Lin, Junjie Wu

TL;DR

LMPhrase addresses the challenge of quality phrase mining without large gold-label datasets by combining a BERT-based Annotator using Perturbed Masking to produce context-aware silver labels with a BART-based Generator fine-tuned on those labels to generate phrases. The two components are merged to exploit their complementary strengths, achieving state-of-the-art results on sentence-level phrase tagging and document-level keyphrase extraction across two domain datasets. The approach demonstrates robustness to limited supervision, preserves informativeness and concordance, and is scalable to low-frequency and domain-specific phrases. This framework has practical implications for efficient, unsupervised phrase mining in diverse domains, with potential extensions to additional NLP tasks and self-supervised enhancements.

Abstract

Phrase mining is a fundamental text mining task that aims to identify quality phrases from context. Nevertheless, the scarcity of extensive gold labels datasets, demanding substantial annotation efforts from experts, renders this task exceptionally challenging. Furthermore, the emerging, infrequent, and domain-specific nature of quality phrases presents further challenges in dealing with this task. In this paper, we propose LMPhrase, a novel unsupervised context-aware quality phrase mining framework built upon large pre-trained language models (LMs). Specifically, we first mine quality phrases as silver labels by employing a parameter-free probing technique called Perturbed Masking on the pre-trained language model BERT (coined as Annotator). In contrast to typical statistic-based or distantly-supervised methods, our silver labels, derived from large pre-trained language models, take into account rich contextual information contained in the LMs. As a result, they bring distinct advantages in preserving informativeness, concordance, and completeness of quality phrases. Secondly, training a discriminative span prediction model heavily relies on massive annotated data and is likely to face the risk of overfitting silver labels. Alternatively, we formalize phrase tagging task as the sequence generation problem by directly fine-tuning on the Sequence-to-Sequence pre-trained language model BART with silver labels (coined as Generator). Finally, we merge the quality phrases from both the Annotator and Generator as the final predictions, considering their complementary nature and distinct characteristics. Extensive experiments show that our LMPhrase consistently outperforms all the existing competitors across two different granularity phrase mining tasks, where each task is tested on two different domain datasets.

Language Model as an Annotator: Unsupervised Context-aware Quality Phrase Generation

TL;DR

LMPhrase addresses the challenge of quality phrase mining without large gold-label datasets by combining a BERT-based Annotator using Perturbed Masking to produce context-aware silver labels with a BART-based Generator fine-tuned on those labels to generate phrases. The two components are merged to exploit their complementary strengths, achieving state-of-the-art results on sentence-level phrase tagging and document-level keyphrase extraction across two domain datasets. The approach demonstrates robustness to limited supervision, preserves informativeness and concordance, and is scalable to low-frequency and domain-specific phrases. This framework has practical implications for efficient, unsupervised phrase mining in diverse domains, with potential extensions to additional NLP tasks and self-supervised enhancements.

Abstract

Phrase mining is a fundamental text mining task that aims to identify quality phrases from context. Nevertheless, the scarcity of extensive gold labels datasets, demanding substantial annotation efforts from experts, renders this task exceptionally challenging. Furthermore, the emerging, infrequent, and domain-specific nature of quality phrases presents further challenges in dealing with this task. In this paper, we propose LMPhrase, a novel unsupervised context-aware quality phrase mining framework built upon large pre-trained language models (LMs). Specifically, we first mine quality phrases as silver labels by employing a parameter-free probing technique called Perturbed Masking on the pre-trained language model BERT (coined as Annotator). In contrast to typical statistic-based or distantly-supervised methods, our silver labels, derived from large pre-trained language models, take into account rich contextual information contained in the LMs. As a result, they bring distinct advantages in preserving informativeness, concordance, and completeness of quality phrases. Secondly, training a discriminative span prediction model heavily relies on massive annotated data and is likely to face the risk of overfitting silver labels. Alternatively, we formalize phrase tagging task as the sequence generation problem by directly fine-tuning on the Sequence-to-Sequence pre-trained language model BART with silver labels (coined as Generator). Finally, we merge the quality phrases from both the Annotator and Generator as the final predictions, considering their complementary nature and distinct characteristics. Extensive experiments show that our LMPhrase consistently outperforms all the existing competitors across two different granularity phrase mining tasks, where each task is tested on two different domain datasets.
Paper Structure (30 sections, 2 equations, 5 figures, 8 tables)

This paper contains 30 sections, 2 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The architecture of LMPhrase: An unsupervised context-aware quality phrase mining framework with pre-trained language model.
  • Figure 2: Illustration of token perturbation.
  • Figure 3: Visualization of the Heatmap for the sentence "sensor selection for energy-efficient ambulatory medical monitoring.”
  • Figure 4: Evaluation results of sentence-level phrase tagging on two benchmark datasets with varying number of silver labels.
  • Figure 5: Qualitative Analysis.