Table of Contents
Fetching ...

Meta-Sel: Efficient Demonstration Selection for In-Context Learning via Supervised Meta-Learning

Xubin Wang, Weijia Jia

TL;DR

The paper Addresses the bottleneck of selecting demonstrations for in-context learning under prompt budgets. It introduces Meta-Sel, a lightweight supervised meta-learning framework that uses two inexpensive meta-features to score (query,candidate) pairs and rank demonstrations via a calibrated logistic regressor in a single offline–online pass, avoiding LLM calls at inference. Meta-Sel is evaluated across four intent datasets and five open-source LLMs, showing top-tier or near-top performance with notable gains for smaller models and with deterministic, auditable rankings. The work also provides a broad empirical benchmark of 12 baseline methods, clarifying where simple similarity signals suffice and where learned weighting yields benefits, thereby offering practical guidance for efficient ICL deployment and future extensions in richer meta-features and generation tasks.

Abstract

Demonstration selection is a practical bottleneck in in-context learning (ICL): under a tight prompt budget, accuracy can change substantially depending on which few-shot examples are included, yet selection must remain cheap enough to run per query over large candidate pools. We propose Meta-Sel, a lightweight supervised meta-learning approach for intent classification that learns a fast, interpretable scoring function for (candidate, query) pairs from labeled training data. Meta-Sel constructs a meta-dataset by sampling pairs from the training split and using class agreement as supervision, then trains a calibrated logistic regressor on two inexpensive meta-features: TF--IDF cosine similarity and a length-compatibility ratio. At inference time, the selector performs a single vectorized scoring pass over the full candidate pool and returns the top-k demonstrations, requiring no model fine-tuning, no online exploration, and no additional LLM calls. This yields deterministic rankings and makes the selection mechanism straightforward to audit via interpretable feature weights. Beyond proposing Meta-Sel, we provide a broad empirical study of demonstration selection, benchmarking 12 methods -- spanning prompt engineering baselines, heuristic selection, reinforcement learning, and influence-based approaches -- across four intent datasets and five open-source LLMs. Across this benchmark, Meta-Sel consistently ranks among the top-performing methods, is particularly effective for smaller models where selection quality can partially compensate for limited model capacity, and maintains competitive selection-time overhead.

Meta-Sel: Efficient Demonstration Selection for In-Context Learning via Supervised Meta-Learning

TL;DR

The paper Addresses the bottleneck of selecting demonstrations for in-context learning under prompt budgets. It introduces Meta-Sel, a lightweight supervised meta-learning framework that uses two inexpensive meta-features to score (query,candidate) pairs and rank demonstrations via a calibrated logistic regressor in a single offline–online pass, avoiding LLM calls at inference. Meta-Sel is evaluated across four intent datasets and five open-source LLMs, showing top-tier or near-top performance with notable gains for smaller models and with deterministic, auditable rankings. The work also provides a broad empirical benchmark of 12 baseline methods, clarifying where simple similarity signals suffice and where learned weighting yields benefits, thereby offering practical guidance for efficient ICL deployment and future extensions in richer meta-features and generation tasks.

Abstract

Demonstration selection is a practical bottleneck in in-context learning (ICL): under a tight prompt budget, accuracy can change substantially depending on which few-shot examples are included, yet selection must remain cheap enough to run per query over large candidate pools. We propose Meta-Sel, a lightweight supervised meta-learning approach for intent classification that learns a fast, interpretable scoring function for (candidate, query) pairs from labeled training data. Meta-Sel constructs a meta-dataset by sampling pairs from the training split and using class agreement as supervision, then trains a calibrated logistic regressor on two inexpensive meta-features: TF--IDF cosine similarity and a length-compatibility ratio. At inference time, the selector performs a single vectorized scoring pass over the full candidate pool and returns the top-k demonstrations, requiring no model fine-tuning, no online exploration, and no additional LLM calls. This yields deterministic rankings and makes the selection mechanism straightforward to audit via interpretable feature weights. Beyond proposing Meta-Sel, we provide a broad empirical study of demonstration selection, benchmarking 12 methods -- spanning prompt engineering baselines, heuristic selection, reinforcement learning, and influence-based approaches -- across four intent datasets and five open-source LLMs. Across this benchmark, Meta-Sel consistently ranks among the top-performing methods, is particularly effective for smaller models where selection quality can partially compensate for limited model capacity, and maintains competitive selection-time overhead.
Paper Structure (35 sections, 1 theorem, 7 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 35 sections, 1 theorem, 7 equations, 5 figures, 3 tables, 1 algorithm.

Key Result

Proposition 3.1

Fix a query $x_q$ and a size-$k$ selector that returns a subset $S$. Assume (i) the events $\{y_c=y_q\}$ are conditionally independent across $x_c\in S$ given $x_q$, and (ii) the prompted LLM predicts the correct label whenever $\exists x_c\in S$ such that $y_c=y_q$. Then the success probability sat and the optimal size-$k$ set is obtained by selecting the $k$ candidates with the largest $p_c(x_q)

Figures (5)

  • Figure 1: Motivation of Meta-Sel. Given a user query ($x$) and a candidate pool ($\mathcal{D}_{\text{train}}$), standard similarity-based selectors (semantically similar) often retrieve examples that share spurious correlations or incorrect labels (visualized as red documents), leading to noisy prompts and incorrect LLM predictions (Top Path). In contrast, Meta-Sel (Bottom Path) leverages a learned scoring function to approximate $P(y|x, c)$, effectively filtering out noise and selecting helpful demonstrations that act as robust reasoning anchors, ensuring correct model outputs.
  • Figure 2: Meta-Sel Framework Overview.1 Offline Meta-Training (top): We sample (query, candidate) pairs from labeled training data and assign meta-labels based on class agreement ($\ell = \mathbbm{1}[y_q{=}y_c]$). Two lightweight meta-features---TF-IDF cosine similarity and length ratio---are extracted per pair, and a logistic regression classifier $h_\theta$ is trained to predict label match. 2 Test-Time Selection (bottom): For a new query $x_q$, we score every candidate in the pool using the learned $h_\theta$ in a single vectorized pass, rank by predicted match probability, and return the top-$k$ demonstrations. The entire selection is deterministic, interpretable, and requires no additional LLM calls.
  • Figure 3: Meta-Sel Ablation Studies and Parameter Sensitivity. All ablation accuracy experiments use Banking with Qwen3-8B ($n{=}500$, 3 seeds). (a) Left: performance improves with more demonstrations, saturating around $k{=}10$. Right: meta-learning provides modest gains over pure similarity ranking. (b) Left: heatmap of mean LR coefficients showing similarity dominance across all four datasets. Right: bar chart with standard deviation across seeds, confirming coefficients range from 6.5 (Liu54) to 16.9 (CLINC) for similarity, while length ratio and intercept stay near zero.
  • Figure 4: Model Scale Effect on Method Performance. Accuracy of five representative methods across five LLMs on each dataset. The shaded area highlights the gap between Meta-Sel and Random selection, illustrating that learned selection provides the largest gains on smaller models.
  • Figure 5: Method Ranking Heatmap Across All Settings. Each cell shows the rank (1=best, 12=worst) of each method for a specific dataset--model combination. The rightmost column shows the mean rank across all 20 settings. Meta-Sel achieves the best mean rank, followed by RDES and Influence Functions.

Theorems & Definitions (1)

  • Proposition 3.1: Top-$k$ optimality under a "one-match suffices" abstraction