SelectLLM: Can LLMs Select Important Instructions to Annotate?
Ritik Sachin Parkar, Jaehyung Kim, Jong Inn Park, Dongyeop Kang
TL;DR
SelectLLM tackles the high cost of assembling large instruction-tuning datasets by using LLM-driven prompting to assess unlabeled instructions, paired with coreset-like diversification via equal-size $K$-means clustering. The method partitions the unlabeled pool into $K$ diverse subsets and asks an LLM to select $ ilde{N}=ig floor N/K ig floor$ instructions per subset, avoiding reliance on labeled data for selection. Empirical results on Dolly and Cleaned Alpaca show that SelectLLM outperforms strong baselines, including Alpagasus, in Rouge-L and Cosine Similarity metrics, while also reducing selection cost (e.g., $2.82 vs $23.76 for 3k samples). The approach demonstrates cross-dataset generalization and reveals insights into LLM compatibility with selection tasks, though it highlights costs and scalability considerations for widespread deployment.
Abstract
Instruction tuning benefits from large and diverse datasets; however, creating such datasets involves a high cost of human labeling. While synthetic datasets generated by large language models (LLMs) have partly solved this issue, they often contain low-quality data. One effective solution is selectively annotating unlabelled instructions, especially given the relative ease of acquiring unlabeled instructions or texts from various sources. However, how to select unlabelled instructions is not well-explored, especially in the context of LLMs. Therefore, we introduce SelectLLM, an alternative framework that leverages the capabilities of LLMs to select unlabeled instructions more effectively. Specifically, SelectLLM consists of two key steps: Coreset-based clustering of unlabelled instructions for enlarging diversity and prompting of LLM to identify the most beneficial instructions within each cluster. We evaluate SelectLLM on AlpacaEval2 and MT-Bench, demonstrating its ability to outperform state-of-the-art methods like Alpagasus. In addition, we compare the performance and compatibility of SelectLLM with various LLMs, such as ChatGPT, LLaMA-3.1-70B, and Gemma-2-27b. SelectLLM's adaptability and robustness are further evidenced by its ability to maintain high performance across both human and synthetic datasets. All code and data are publicly available (https://github.com/minnesotanlp/select-llm).
