Table of Contents
Fetching ...

GRAD: Generative Retrieval-Aligned Demonstration Sampler for Efficient Few-Shot Reasoning

Oussama Gabouj, Kamel Charaf, Ivan Zakazov, Nicolas Baldwin, Robert West

TL;DR

GRAD introduces a reinforcement-learning–driven generator that creates input-specific, token-budgeted demonstrations to accompany prompts, outperforming static RAG under tight budgets and showing strong generalization to OOD domains. The GRAD pipeline includes a warm-start GRADi variant, a composite multi-objective reward, and a strict 300-token demonstration cap plus a 256-token final answer limit, balancing accuracy and brevity. Evaluations across GSM8K and diverse OOD benchmarks (MMLU*, MathQA*, draw-structured, ARC_challenge, DeepMind basic_math) show that GRAD, especially with larger backbones, consistently surpasses RAG and other baselines, while demonstrations from smaller models can effectively guide larger models. The work advocates a scalable, dynamic few-shot paradigm and discusses potential hybrids (H-GRAD) and limitations around factuality, fixed demonstration counts, and ethical considerations in generative prompts.

Abstract

Large Language Models (LLMs) achieve strong performance across diverse tasks, but their effectiveness often depends on the quality of the provided context. Retrieval-Augmented Generation (RAG) enriches prompts with external information, but its reliance on static databases constrains adaptability and can result in irrelevant demonstrations. In this work, we propose a Generative Retrieval-Aligned Demonstrator (GRAD), a dynamic demonstration-based approach where an LLM model is trained to generate input-specific concise demonstrations. By tailoring demonstrations to each input, our method offers better contextual support than traditional RAG approaches. We demonstrate the superiority of GRAD under budget constraints, where we limit both the number of tokens used per demonstration and the number of tokens used for the final output. Trained solely on a math dataset, GRAD consistently outperforms strong baselines on Qwen2.5-14B across mathematical reasoning and advanced STEM questions, highlighting GRAD's robust generalization to out-of-distribution (OOD) domains such as physics, chemistry, and computer science. Furthermore, we show that demonstrations generated by trained smaller models can effectively guide larger target models, reducing training costs while maintaining competitive accuracy. Overall, this work introduces a scalable demonstration generator model presenting the first step toward a dynamic few-shot learning paradigm in resource-constrained settings. We release the code used for the project.

GRAD: Generative Retrieval-Aligned Demonstration Sampler for Efficient Few-Shot Reasoning

TL;DR

GRAD introduces a reinforcement-learning–driven generator that creates input-specific, token-budgeted demonstrations to accompany prompts, outperforming static RAG under tight budgets and showing strong generalization to OOD domains. The GRAD pipeline includes a warm-start GRADi variant, a composite multi-objective reward, and a strict 300-token demonstration cap plus a 256-token final answer limit, balancing accuracy and brevity. Evaluations across GSM8K and diverse OOD benchmarks (MMLU*, MathQA*, draw-structured, ARC_challenge, DeepMind basic_math) show that GRAD, especially with larger backbones, consistently surpasses RAG and other baselines, while demonstrations from smaller models can effectively guide larger models. The work advocates a scalable, dynamic few-shot paradigm and discusses potential hybrids (H-GRAD) and limitations around factuality, fixed demonstration counts, and ethical considerations in generative prompts.

Abstract

Large Language Models (LLMs) achieve strong performance across diverse tasks, but their effectiveness often depends on the quality of the provided context. Retrieval-Augmented Generation (RAG) enriches prompts with external information, but its reliance on static databases constrains adaptability and can result in irrelevant demonstrations. In this work, we propose a Generative Retrieval-Aligned Demonstrator (GRAD), a dynamic demonstration-based approach where an LLM model is trained to generate input-specific concise demonstrations. By tailoring demonstrations to each input, our method offers better contextual support than traditional RAG approaches. We demonstrate the superiority of GRAD under budget constraints, where we limit both the number of tokens used per demonstration and the number of tokens used for the final output. Trained solely on a math dataset, GRAD consistently outperforms strong baselines on Qwen2.5-14B across mathematical reasoning and advanced STEM questions, highlighting GRAD's robust generalization to out-of-distribution (OOD) domains such as physics, chemistry, and computer science. Furthermore, we show that demonstrations generated by trained smaller models can effectively guide larger target models, reducing training costs while maintaining competitive accuracy. Overall, this work introduces a scalable demonstration generator model presenting the first step toward a dynamic few-shot learning paradigm in resource-constrained settings. We release the code used for the project.

Paper Structure

This paper contains 64 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Example input query from GSM8K with demonstrations and outputs from RAG and GRAD. RAG retrieves demonstrations from a static database, whereas GRAD generates task-specific demonstrations within a token budget. A final output length constraint is applied in both cases; GRAD produces shorter demonstrations and a more concise final answer.
  • Figure 2: Overview of the GRAD pipeline. Step 1: The model generates demonstrations, which are concatenated with the system prompt and the user's query to form the context. Step 2: The context guides the target model to produce a reasoning trace and the final answer. Step 3: The predicted answer is replaced with the correct answer and passed through the frozen LLM to compute the token-level log probabilities. Step 4: Computing a multi-objective reward to ensure confidence and correctness of the final answer and compliance with the token budget.
  • Figure 3: Heatmap of accuracy differences between GRADi and RAG. Red denotes gains for GRADi, blue for RAG, and lighter cells indicate similar performance. Datasets on the x-axis are ordered by their semantic similarity from left to right in decreasing order. Models are defined in the y-axis, ordered based on their size from top to bottom. Each cell shows the mean percentage-point difference in exact-match accuracy for that (model, dataset) pair, with the colorbar indicating magnitude and sign.
  • Figure 4: Token distribution length
  • Figure 5: Heatmap of accuracy differences between GRADi and different baselines. Red denotes gains for GRADi, blue for other baselines, respectively, and lighter cells indicate similar performance.