Minimizing PLM-Based Few-Shot Intent Detectors
Haode Zhang, Albert Y. S. Lam, Xiao-Ming Wu
TL;DR
This work tackles the challenge of deploying PLM-based few-shot intent detectors in resource-constrained settings by combining LLM-based data augmentation, CoFi-based Transformer compression, and a novel V-Prune vocabulary pruning mechanism with PCA-driven embedding reduction. The approach augments scarce labeled data with off-the-shelf LLMs, distills a small student from a large teacher via CoFi, and constructs a task-specific, drastically smaller vocabulary while compensating for missing tokens through nearest-neighbor mapping. Across four real-world benchmarks in a 5-shot regime, the method achieves about a 21x decrease in memory usage (including both Transformer and vocabulary) with almost no loss in accuracy, demonstrating practical deployability on devices with limited resources. The results highlight the importance of task-specific vocabulary design and data augmentation in few-shot PLM compression, offering a scalable path toward efficient on-device intent detection.
Abstract
Recent research has demonstrated the feasibility of training efficient intent detectors based on pre-trained language model~(PLM) with limited labeled data. However, deploying these detectors in resource-constrained environments such as mobile devices poses challenges due to their large sizes. In this work, we aim to address this issue by exploring techniques to minimize the size of PLM-based intent detectors trained with few-shot data. Specifically, we utilize large language models (LLMs) for data augmentation, employ a cutting-edge model compression method for knowledge distillation, and devise a vocabulary pruning mechanism called V-Prune. Through these approaches, we successfully achieve a compression ratio of 21 in model memory usage, including both Transformer and the vocabulary, while maintaining almost identical performance levels on four real-world benchmarks.
