Table of Contents
Fetching ...

FastGAS: Fast Graph-based Annotation Selection for In-Context Learning

Zihan Chen, Song Wang, Cong Shen, Jundong Li

TL;DR

FastGAS proposes a graph-based, unsupervised method to efficiently select diverse and representative unlabeled instances for in-context learning prompts. It constructs a data similarity graph from embeddings, partitions it with a multi-level graph bisection into K components, and greedily selects M/K high-degree nodes per component, achieving strong ICL prompts with dramatically reduced computation time. The approach outperforms state-of-the-art baselines across seven datasets and multiple language models, with robust performance under different prompt retrieval strategies. The method is theoretically grounded via a greedy selection guarantee and demonstrates practical applicability across LLMs of varying sizes, making prompt construction for ICL both faster and more reliable.

Abstract

In-context learning (ICL) empowers large language models (LLMs) to tackle new tasks by using a series of training instances as prompts. Since generating the prompts needs to sample from a vast pool of instances and annotate them (e.g., add labels in classification task), existing methods have proposed to select a subset of unlabeled examples for annotation, thus enhancing the quality of prompts and concurrently mitigating annotation costs. However, these methods often require a long time to select instances due to their complexity, hindering their practical viability. To address this limitation, we propose a graph-based selection method, FastGAS, designed to efficiently identify high-quality instances while minimizing computational overhead. Initially, we construct a data similarity graph based on instance similarities. Subsequently, employing a graph partitioning algorithm, we partition the graph into pieces. Within each piece (i.e., subgraph), we adopt a greedy approach to pick the most representative nodes. By aggregating nodes from diverse pieces and annotating the corresponding instances, we identify a set of diverse and representative instances for ICL. Compared to prior approaches, our method not only exhibits superior performance on different tasks but also significantly reduces selection time. In addition, we demonstrate the efficacy of our approach in LLMs of larger sizes.

FastGAS: Fast Graph-based Annotation Selection for In-Context Learning

TL;DR

FastGAS proposes a graph-based, unsupervised method to efficiently select diverse and representative unlabeled instances for in-context learning prompts. It constructs a data similarity graph from embeddings, partitions it with a multi-level graph bisection into K components, and greedily selects M/K high-degree nodes per component, achieving strong ICL prompts with dramatically reduced computation time. The approach outperforms state-of-the-art baselines across seven datasets and multiple language models, with robust performance under different prompt retrieval strategies. The method is theoretically grounded via a greedy selection guarantee and demonstrates practical applicability across LLMs of varying sizes, making prompt construction for ICL both faster and more reliable.

Abstract

In-context learning (ICL) empowers large language models (LLMs) to tackle new tasks by using a series of training instances as prompts. Since generating the prompts needs to sample from a vast pool of instances and annotate them (e.g., add labels in classification task), existing methods have proposed to select a subset of unlabeled examples for annotation, thus enhancing the quality of prompts and concurrently mitigating annotation costs. However, these methods often require a long time to select instances due to their complexity, hindering their practical viability. To address this limitation, we propose a graph-based selection method, FastGAS, designed to efficiently identify high-quality instances while minimizing computational overhead. Initially, we construct a data similarity graph based on instance similarities. Subsequently, employing a graph partitioning algorithm, we partition the graph into pieces. Within each piece (i.e., subgraph), we adopt a greedy approach to pick the most representative nodes. By aggregating nodes from diverse pieces and annotating the corresponding instances, we identify a set of diverse and representative instances for ICL. Compared to prior approaches, our method not only exhibits superior performance on different tasks but also significantly reduces selection time. In addition, we demonstrate the efficacy of our approach in LLMs of larger sizes.
Paper Structure (32 sections, 1 theorem, 11 equations, 6 figures, 7 tables)

This paper contains 32 sections, 1 theorem, 11 equations, 6 figures, 7 tables.

Key Result

Proposition 3.1

Given the budget $n$ and graph $\mathcal{G}$, the greedy algorithm will select $\mathcal{V}^{sel} = \{v^1,..., v^n\}$ that maximize the number of edges within $\mathcal{V}^{sel}$ and those connecting $\mathcal{V}^{sel}$ and $\mathcal{G}\setminus \mathcal{V}^{sel}$.

Figures (6)

  • Figure 1: Comparison of our method and two baselines on three classification tasks (MPRC, SST-5, and DBpedia) with respect to time consumption during subset selection. The annotation budget is $18$. The y-axis represents the time consumption with a log scale. Notably, our method significantly reduces the time cost in comparison to both baseline methods.
  • Figure 2: An overview FastGAS. Given the unlabeled data pool, we initially construct a graph based on data similarity. This graph is then partitioned into distinct components. Within each component, we employ a greedy algorithm to select nodes until we reach the annotation budget. The selected instances are annotated and subsequently used to retrieve ICL prompts for the task.
  • Figure 3: Comparison of our method and other graph-based baselines with respect to different $k$ for the construction of the graph.
  • Figure 4: Comparison of our method and two baselines on three classification tasks (MPRC, SST-5, and DBpedia) with respect to time consumption during subset selection. The annotation budget is $100$. The y-axis represents the time consumption with a log scale.
  • Figure 5: Performance of FastGAS across different numbers of partitions with an annotation budget of $100$.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Proposition 3.1
  • Remark 1
  • proof