Table of Contents
Fetching ...

SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics

Zhiwen You, Kanyao Han, Haotian Zhu, Bertram Ludäscher, Jana Diesner

TL;DR

SciPrompt addresses the challenge of fine-grained scientific topic classification under few- and zero-shot conditions by automatically retrieving domain-specific terms from external knowledge bases and weighting them in a verbalizer. It extends verbalization from tokens to phrases via knowledge retrieval and semantic filtering, guided by SciNLI-based domain adaptation, and employs a weighted, vector-based verbalizer to map MLM predictions to scientific topics. Evaluated on SDPRA 2021, arXiv, S2ORC, and Emerging NLP, SciPrompt achieves state-of-the-art performance in low-resource settings and demonstrates efficiency gains with its Soft variant. The approach offers scalable, knowledge-grounded scientific text classification for rapidly emerging scientific topics, with practical implications for cross-domain literature organization and search.

Abstract

Prompt-based fine-tuning has become an essential method for eliciting information encoded in pre-trained language models for a variety of tasks, including text classification. For multi-class classification tasks, prompt-based fine-tuning under low-resource scenarios has resulted in performance levels comparable to those of fully fine-tuning methods. Previous studies have used crafted prompt templates and verbalizers, mapping from the label terms space to the class space, to solve the classification problem as a masked language modeling task. However, cross-domain and fine-grained prompt-based fine-tuning with an automatically enriched verbalizer remains unexplored, mainly due to the difficulty and costs of manually selecting domain label terms for the verbalizer, which requires humans with domain expertise. To address this challenge, we introduce SciPrompt, a framework designed to automatically retrieve scientific topic-related terms for low-resource text classification tasks. To this end, we select semantically correlated and domain-specific label terms within the context of scientific literature for verbalizer augmentation. Furthermore, we propose a new verbalization strategy that uses correlation scores as additional weights to enhance the prediction performance of the language model during model tuning. Our method outperforms state-of-the-art, prompt-based fine-tuning methods on scientific text classification tasks under few and zero-shot settings, especially in classifying fine-grained and emerging scientific topics.

SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics

TL;DR

SciPrompt addresses the challenge of fine-grained scientific topic classification under few- and zero-shot conditions by automatically retrieving domain-specific terms from external knowledge bases and weighting them in a verbalizer. It extends verbalization from tokens to phrases via knowledge retrieval and semantic filtering, guided by SciNLI-based domain adaptation, and employs a weighted, vector-based verbalizer to map MLM predictions to scientific topics. Evaluated on SDPRA 2021, arXiv, S2ORC, and Emerging NLP, SciPrompt achieves state-of-the-art performance in low-resource settings and demonstrates efficiency gains with its Soft variant. The approach offers scalable, knowledge-grounded scientific text classification for rapidly emerging scientific topics, with practical implications for cross-domain literature organization and search.

Abstract

Prompt-based fine-tuning has become an essential method for eliciting information encoded in pre-trained language models for a variety of tasks, including text classification. For multi-class classification tasks, prompt-based fine-tuning under low-resource scenarios has resulted in performance levels comparable to those of fully fine-tuning methods. Previous studies have used crafted prompt templates and verbalizers, mapping from the label terms space to the class space, to solve the classification problem as a masked language modeling task. However, cross-domain and fine-grained prompt-based fine-tuning with an automatically enriched verbalizer remains unexplored, mainly due to the difficulty and costs of manually selecting domain label terms for the verbalizer, which requires humans with domain expertise. To address this challenge, we introduce SciPrompt, a framework designed to automatically retrieve scientific topic-related terms for low-resource text classification tasks. To this end, we select semantically correlated and domain-specific label terms within the context of scientific literature for verbalizer augmentation. Furthermore, we propose a new verbalization strategy that uses correlation scores as additional weights to enhance the prediction performance of the language model during model tuning. Our method outperforms state-of-the-art, prompt-based fine-tuning methods on scientific text classification tasks under few and zero-shot settings, especially in classifying fine-grained and emerging scientific topics.
Paper Structure (30 sections, 4 equations, 5 figures, 11 tables)

This paper contains 30 sections, 4 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Overall framework of SciPrompt. The left side shows the overall process of masked language modeling for performing the text classification task. The right side shows our proposed knowledge retrieval and domain-adaptive filtering phase (§\ref{['sec:method']}). The prediction results, such as CR and SE, correspond to the class labels for Cryptography and Software Engineering, respectively, and are used for scientific knowledge retrieval.
  • Figure 2: Performance comparison of few-shot methods over three datasets in Table \ref{['few-shot']}. We report the mean accuracy of each setting. Our method shows high stability in the accuracy distribution compared to the considered baseline models.
  • Figure 3: Model comparison through the Emerging NLP dataset under five-shot and zero-shot settings (§\ref{['emerging']}).
  • Figure 4: Various numbers of label terms across four datasets under three phrases.
  • Figure 5: Box chart for all methods in the few-shot setting over three datasets.