Table of Contents
Fetching ...

SelectLLM: Can LLMs Select Important Instructions to Annotate?

Ritik Sachin Parkar, Jaehyung Kim, Jong Inn Park, Dongyeop Kang

TL;DR

SelectLLM tackles the high cost of assembling large instruction-tuning datasets by using LLM-driven prompting to assess unlabeled instructions, paired with coreset-like diversification via equal-size $K$-means clustering. The method partitions the unlabeled pool into $K$ diverse subsets and asks an LLM to select $ ilde{N}=ig floor N/K ig floor$ instructions per subset, avoiding reliance on labeled data for selection. Empirical results on Dolly and Cleaned Alpaca show that SelectLLM outperforms strong baselines, including Alpagasus, in Rouge-L and Cosine Similarity metrics, while also reducing selection cost (e.g., $2.82 vs $23.76 for 3k samples). The approach demonstrates cross-dataset generalization and reveals insights into LLM compatibility with selection tasks, though it highlights costs and scalability considerations for widespread deployment.

Abstract

Instruction tuning benefits from large and diverse datasets; however, creating such datasets involves a high cost of human labeling. While synthetic datasets generated by large language models (LLMs) have partly solved this issue, they often contain low-quality data. One effective solution is selectively annotating unlabelled instructions, especially given the relative ease of acquiring unlabeled instructions or texts from various sources. However, how to select unlabelled instructions is not well-explored, especially in the context of LLMs. Therefore, we introduce SelectLLM, an alternative framework that leverages the capabilities of LLMs to select unlabeled instructions more effectively. Specifically, SelectLLM consists of two key steps: Coreset-based clustering of unlabelled instructions for enlarging diversity and prompting of LLM to identify the most beneficial instructions within each cluster. We evaluate SelectLLM on AlpacaEval2 and MT-Bench, demonstrating its ability to outperform state-of-the-art methods like Alpagasus. In addition, we compare the performance and compatibility of SelectLLM with various LLMs, such as ChatGPT, LLaMA-3.1-70B, and Gemma-2-27b. SelectLLM's adaptability and robustness are further evidenced by its ability to maintain high performance across both human and synthetic datasets. All code and data are publicly available (https://github.com/minnesotanlp/select-llm).

SelectLLM: Can LLMs Select Important Instructions to Annotate?

TL;DR

SelectLLM tackles the high cost of assembling large instruction-tuning datasets by using LLM-driven prompting to assess unlabeled instructions, paired with coreset-like diversification via equal-size -means clustering. The method partitions the unlabeled pool into diverse subsets and asks an LLM to select instructions per subset, avoiding reliance on labeled data for selection. Empirical results on Dolly and Cleaned Alpaca show that SelectLLM outperforms strong baselines, including Alpagasus, in Rouge-L and Cosine Similarity metrics, while also reducing selection cost (e.g., 23.76 for 3k samples). The approach demonstrates cross-dataset generalization and reveals insights into LLM compatibility with selection tasks, though it highlights costs and scalability considerations for widespread deployment.

Abstract

Instruction tuning benefits from large and diverse datasets; however, creating such datasets involves a high cost of human labeling. While synthetic datasets generated by large language models (LLMs) have partly solved this issue, they often contain low-quality data. One effective solution is selectively annotating unlabelled instructions, especially given the relative ease of acquiring unlabeled instructions or texts from various sources. However, how to select unlabelled instructions is not well-explored, especially in the context of LLMs. Therefore, we introduce SelectLLM, an alternative framework that leverages the capabilities of LLMs to select unlabeled instructions more effectively. Specifically, SelectLLM consists of two key steps: Coreset-based clustering of unlabelled instructions for enlarging diversity and prompting of LLM to identify the most beneficial instructions within each cluster. We evaluate SelectLLM on AlpacaEval2 and MT-Bench, demonstrating its ability to outperform state-of-the-art methods like Alpagasus. In addition, we compare the performance and compatibility of SelectLLM with various LLMs, such as ChatGPT, LLaMA-3.1-70B, and Gemma-2-27b. SelectLLM's adaptability and robustness are further evidenced by its ability to maintain high performance across both human and synthetic datasets. All code and data are publicly available (https://github.com/minnesotanlp/select-llm).
Paper Structure (24 sections, 2 equations, 10 figures, 17 tables, 2 algorithms)

This paper contains 24 sections, 2 equations, 10 figures, 17 tables, 2 algorithms.

Figures (10)

  • Figure 1: Conceptual comparison between previous approaches to select instructions and SelectLLM. Focusing on input instructions (top) is unable to consider the difficulty or uncertainty of response. Output-based methods (middle) can suffer from the inference cost and quality issues of synthetic responses. SelectLLM (bottom) does not suffer from these issues by estimating the effectiveness of instructions via prompting LLMs.
  • Figure 2: Experiments to verify LLMs' capability to infer the importance of unlabelled instructions. We prompt ChatGPT to sort the instructions based on their effectiveness for model training; then, we compare the performance of three fine-tuned LMs (LLaMA-2) on instructions with different ranks (First, center, and last). A full prompt is presented in Appendix \ref{['appendix:list_prompt']}.
  • Figure 3: Illustration of the proposed SelectLLM.
  • Figure 4: Designed input prompt of SelectLLM.
  • Figure 5: Qualitative example of selection with a given query composed of 14 instructions on Dolly.
  • ...and 5 more figures