Table of Contents
Fetching ...

Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen Bi, Chuanqi Tan, Fei Huang, Huajun Chen

TL;DR

This work tackles the data-efficiency gap in NLP by enabling small pre-trained language models to excel as few-shot learners without manual prompt engineering. It introduces DifferentiAble pRompT (DART), a framework that differentiably optimizes internal prompt templates and label representations using unused tokens, jointly trained with a fluency constraint. Across 15 NLP tasks, DART outperforms standard fine-tuning and competes with or surpasses LM-BFF, with especially large gains on relation- and event-extraction tasks in low-data regimes. The approach is model-agnostic, parameter-efficient, and extends to other architectures like GPT-2, making few-shot learning more practical for real-world applications.

Abstract

Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners. However, their effectiveness depends mainly on scaling the model parameters and prompt design, hindering their implementation in most real-world applications. This study proposes a novel pluggable, extensible, and efficient approach named DifferentiAble pRompT (DART), which can convert small language models into better few-shot learners without any prompt engineering. The main principle behind this approach involves reformulating potential natural language processing tasks into the task of a pre-trained language model and differentially optimizing the prompt template as well as the target label with backpropagation. Furthermore, the proposed approach can be: (i) Plugged to any pre-trained language models; (ii) Extended to widespread classification tasks. A comprehensive evaluation of standard NLP tasks demonstrates that the proposed approach achieves a better few-shot performance. Code is available in https://github.com/zjunlp/DART.

Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

TL;DR

This work tackles the data-efficiency gap in NLP by enabling small pre-trained language models to excel as few-shot learners without manual prompt engineering. It introduces DifferentiAble pRompT (DART), a framework that differentiably optimizes internal prompt templates and label representations using unused tokens, jointly trained with a fluency constraint. Across 15 NLP tasks, DART outperforms standard fine-tuning and competes with or surpasses LM-BFF, with especially large gains on relation- and event-extraction tasks in low-data regimes. The approach is model-agnostic, parameter-efficient, and extends to other architectures like GPT-2, making few-shot learning more practical for real-world applications.

Abstract

Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners. However, their effectiveness depends mainly on scaling the model parameters and prompt design, hindering their implementation in most real-world applications. This study proposes a novel pluggable, extensible, and efficient approach named DifferentiAble pRompT (DART), which can convert small language models into better few-shot learners without any prompt engineering. The main principle behind this approach involves reformulating potential natural language processing tasks into the task of a pre-trained language model and differentially optimizing the prompt template as well as the target label with backpropagation. Furthermore, the proposed approach can be: (i) Plugged to any pre-trained language models; (ii) Extended to widespread classification tasks. A comprehensive evaluation of standard NLP tasks demonstrates that the proposed approach achieves a better few-shot performance. Code is available in https://github.com/zjunlp/DART.

Paper Structure

This paper contains 28 sections, 11 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The architecture of DifferentiAble pRompT (DART) model comparing with MLM pre-training and conventional fine-tuning, where $T_i$ and $Y_i$ are unused or special tokens in the vocabulary. We leverage a few parameters within the language model as the template and label tokens and optimize them via backpropagation without introducing additional parameters apart from the model.
  • Figure 2: (a) Few-shot results using the ACE-2005. We used K = 4, 8, 16, and 32 (# examples per class) with BERT. (FT= Fine-tuning) (b) BERT-large vs. GPT-2-medium results for the SemEval. Moreover, for lower K, our method consistently outperforms conventional fine-tuning.
  • Figure 3: Visualization of masked tokens' representation in different training steps (with training 10, 30, 50, 70 steps from left to right) with fixed prompts.
  • Figure 4: Visualization of masked tokens' representation in different training steps (with training 10, 30, 50, 70 steps from left to right) with differentiable prompts.
  • Figure 5: A 3D visualization of several label representations learned in DART.
  • ...and 1 more figures