Enhancing Low-Resource LLMs Classification with PEFT and Synthetic Data
Parth Patwa, Simone Filice, Zhiyu Chen, Giuseppe Castellucci, Oleg Rokhlenko, Shervin Malmasi
TL;DR
The paper tackles efficient text classification in very low-resource settings by combining PEFT with synthetic data augmentation. It proposes a three-step generate-filter-train pipeline that uses a single LLM to generate class-specific synthetic data, filters out label-inconsistent samples via ICL, and then fine-tunes with LoRA using the real plus synthetic data. Experiments on SST2, AG News, and TREC with Vicuna-7b/13b show that this approach achieves accuracy comparable to or better than in-context learning while delivering significant inference speedups (approximately 2× to 5×). The findings demonstrate that better utilization of a few labeled examples, via self-generated synthetic data and careful data curation, can unlock competitive LLM-based classification without external datasets or additional models, with future work aimed at enhancing data diversity and mitigating biases.
Abstract
Large Language Models (LLMs) operating in 0-shot or few-shot settings achieve competitive results in Text Classification tasks. In-Context Learning (ICL) typically achieves better accuracy than the 0-shot setting, but it pays in terms of efficiency, due to the longer input prompt. In this paper, we propose a strategy to make LLMs as efficient as 0-shot text classifiers, while getting comparable or better accuracy than ICL. Our solution targets the low resource setting, i.e., when only 4 examples per class are available. Using a single LLM and few-shot real data we perform a sequence of generation, filtering and Parameter-Efficient Fine-Tuning steps to create a robust and efficient classifier. Experimental results show that our approach leads to competitive results on multiple text classification datasets.
