Table of Contents
Fetching ...

Revisit Few-shot Intent Classification with PLMs: Direct Fine-tuning vs. Continual Pre-training

Haode Zhang, Haowen Liang, Liming Zhan, Albert Y. S. Lam, Xiao-Ming Wu

TL;DR

The paper investigates whether continual pre-training is essential for few-shot intent classification and demonstrates that direct fine-tuning (DFT) of PLMs on small labeled sets can achieve competitive results. It introduces DFT++, which combines context augmentation using a generative PLM (GPT-J) to create contextually relevant unlabeled utterances and sequential self-distillation to exploit multi-view information, all while avoiding external training corpora. Across BANKING77, HINT3, HWU64, and MCID-English, DFT++—especially with RoBERTa—consistently matches or surpasses state-of-the-art continual pre-training baselines in 5-shot and 10-shot regimes and reduces dependence on large external datasets. The work also provides thorough analyses of augmentation strategies, hyperparameters, and the complementary nature of continual pre-training and DFT++, highlighting practical implications for data-scarce deployment of task-oriented dialogue systems.

Abstract

We consider the task of few-shot intent detection, which involves training a deep learning model to classify utterances based on their underlying intents using only a small amount of labeled data. The current approach to address this problem is through continual pre-training, i.e., fine-tuning pre-trained language models (PLMs) on external resources (e.g., conversational corpora, public intent detection datasets, or natural language understanding datasets) before using them as utterance encoders for training an intent classifier. In this paper, we show that continual pre-training may not be essential, since the overfitting problem of PLMs on this task may not be as serious as expected. Specifically, we find that directly fine-tuning PLMs on only a handful of labeled examples already yields decent results compared to methods that employ continual pre-training, and the performance gap diminishes rapidly as the number of labeled data increases. To maximize the utilization of the limited available data, we propose a context augmentation method and leverage sequential self-distillation to boost performance. Comprehensive experiments on real-world benchmarks show that given only two or more labeled samples per class, direct fine-tuning outperforms many strong baselines that utilize external data sources for continual pre-training. The code can be found at https://github.com/hdzhang-code/DFTPlus.

Revisit Few-shot Intent Classification with PLMs: Direct Fine-tuning vs. Continual Pre-training

TL;DR

The paper investigates whether continual pre-training is essential for few-shot intent classification and demonstrates that direct fine-tuning (DFT) of PLMs on small labeled sets can achieve competitive results. It introduces DFT++, which combines context augmentation using a generative PLM (GPT-J) to create contextually relevant unlabeled utterances and sequential self-distillation to exploit multi-view information, all while avoiding external training corpora. Across BANKING77, HINT3, HWU64, and MCID-English, DFT++—especially with RoBERTa—consistently matches or surpasses state-of-the-art continual pre-training baselines in 5-shot and 10-shot regimes and reduces dependence on large external datasets. The work also provides thorough analyses of augmentation strategies, hyperparameters, and the complementary nature of continual pre-training and DFT++, highlighting practical implications for data-scarce deployment of task-oriented dialogue systems.

Abstract

We consider the task of few-shot intent detection, which involves training a deep learning model to classify utterances based on their underlying intents using only a small amount of labeled data. The current approach to address this problem is through continual pre-training, i.e., fine-tuning pre-trained language models (PLMs) on external resources (e.g., conversational corpora, public intent detection datasets, or natural language understanding datasets) before using them as utterance encoders for training an intent classifier. In this paper, we show that continual pre-training may not be essential, since the overfitting problem of PLMs on this task may not be as serious as expected. Specifically, we find that directly fine-tuning PLMs on only a handful of labeled examples already yields decent results compared to methods that employ continual pre-training, and the performance gap diminishes rapidly as the number of labeled data increases. To maximize the utilization of the limited available data, we propose a context augmentation method and leverage sequential self-distillation to boost performance. Comprehensive experiments on real-world benchmarks show that given only two or more labeled samples per class, direct fine-tuning outperforms many strong baselines that utilize external data sources for continual pre-training. The code can be found at https://github.com/hdzhang-code/DFTPlus.
Paper Structure (31 sections, 4 equations, 9 figures, 10 tables)

This paper contains 31 sections, 4 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Illustration of continual pre-training (orange) and direct fine-tuning (green).
  • Figure 2: Illustration of DFT++ with $2$ classes and $2$ labeled examples per class. GPT-J is employed to generate contextually relevant unlabeled utterances. Sequential self-distillation is performed to further boost the performance.
  • Figure 3: Training and test learning curves of DFT with BERT and RoBERTa as text encoder respectively.
  • Figure 4: Comparison between DFT (solid lines) and IsoIntentBERT (dashed lines). The benefit from continued pre-training(IsoIntentBERT) decays quickly.
  • Figure 5: An example of the prompt and generated utterances in a $5$-shot scenario. Green utterances are successful cases, while the red one is a failure case.
  • ...and 4 more figures