Table of Contents
Fetching ...

Minimizing PLM-Based Few-Shot Intent Detectors

Haode Zhang, Albert Y. S. Lam, Xiao-Ming Wu

TL;DR

This work tackles the challenge of deploying PLM-based few-shot intent detectors in resource-constrained settings by combining LLM-based data augmentation, CoFi-based Transformer compression, and a novel V-Prune vocabulary pruning mechanism with PCA-driven embedding reduction. The approach augments scarce labeled data with off-the-shelf LLMs, distills a small student from a large teacher via CoFi, and constructs a task-specific, drastically smaller vocabulary while compensating for missing tokens through nearest-neighbor mapping. Across four real-world benchmarks in a 5-shot regime, the method achieves about a 21x decrease in memory usage (including both Transformer and vocabulary) with almost no loss in accuracy, demonstrating practical deployability on devices with limited resources. The results highlight the importance of task-specific vocabulary design and data augmentation in few-shot PLM compression, offering a scalable path toward efficient on-device intent detection.

Abstract

Recent research has demonstrated the feasibility of training efficient intent detectors based on pre-trained language model~(PLM) with limited labeled data. However, deploying these detectors in resource-constrained environments such as mobile devices poses challenges due to their large sizes. In this work, we aim to address this issue by exploring techniques to minimize the size of PLM-based intent detectors trained with few-shot data. Specifically, we utilize large language models (LLMs) for data augmentation, employ a cutting-edge model compression method for knowledge distillation, and devise a vocabulary pruning mechanism called V-Prune. Through these approaches, we successfully achieve a compression ratio of 21 in model memory usage, including both Transformer and the vocabulary, while maintaining almost identical performance levels on four real-world benchmarks.

Minimizing PLM-Based Few-Shot Intent Detectors

TL;DR

This work tackles the challenge of deploying PLM-based few-shot intent detectors in resource-constrained settings by combining LLM-based data augmentation, CoFi-based Transformer compression, and a novel V-Prune vocabulary pruning mechanism with PCA-driven embedding reduction. The approach augments scarce labeled data with off-the-shelf LLMs, distills a small student from a large teacher via CoFi, and constructs a task-specific, drastically smaller vocabulary while compensating for missing tokens through nearest-neighbor mapping. Across four real-world benchmarks in a 5-shot regime, the method achieves about a 21x decrease in memory usage (including both Transformer and vocabulary) with almost no loss in accuracy, demonstrating practical deployability on devices with limited resources. The results highlight the importance of task-specific vocabulary design and data augmentation in few-shot PLM compression, offering a scalable path toward efficient on-device intent detection.

Abstract

Recent research has demonstrated the feasibility of training efficient intent detectors based on pre-trained language model~(PLM) with limited labeled data. However, deploying these detectors in resource-constrained environments such as mobile devices poses challenges due to their large sizes. In this work, we aim to address this issue by exploring techniques to minimize the size of PLM-based intent detectors trained with few-shot data. Specifically, we utilize large language models (LLMs) for data augmentation, employ a cutting-edge model compression method for knowledge distillation, and devise a vocabulary pruning mechanism called V-Prune. Through these approaches, we successfully achieve a compression ratio of 21 in model memory usage, including both Transformer and the vocabulary, while maintaining almost identical performance levels on four real-world benchmarks.
Paper Structure (17 sections, 2 equations, 4 figures, 4 tables)

This paper contains 17 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The efficacy of our approach under the 5-shot scenario. Model memory usage is denoted by the stacked bars, while model performance by the lines.
  • Figure 2: Illustration of our method. Off-the-shelf generative language models are adopted to generate new utterances according to the few labeled data. These new data are combined with the few data to compress a large teacher model into a small student model, and also to extract a small vocabulary.
  • Figure 3: An example of the prompt and generated utterances under the $5$-shot scenario.
  • Figure 4: Impact of hyper-parameters on the performance.