Table of Contents
Fetching ...

Improving In-Context Learning with Reasoning Distillation

Nafis Sadeq, Xin Xu, Zhouhang Xie, Julian McAuley, Byungkyu Kang, Prarit Lamba, Xiang Gao

TL;DR

This work tackles the challenge of inductive reasoning in in-context learning by proposing ReDis, a black-box reasoning distillation pipeline that uses a teacher model to generate candidate hypotheses, evaluates their fitness through noisy rule-following, and then applies supervised fine-tuning and preference alignment to produce a student capable of efficient hypothesis search. The method achieves substantial performance gains across four diverse inductive-reasoning tasks (List Function, 1D ARC, ACRE, MiniSCAN) on multiple backbones, and even matches or surpasses GPT-4o in some settings, while significantly reducing inference-time token costs. A key contribution is the ORPO alignment strategy, which optimizes a combined loss $\mathcal{L}=\mathcal{L}_{\mathrm{sft}}+\lambda\mathcal{L}_{\mathrm{or}}$ to favor high-quality rule generation within a smaller search space. The results demonstrate that distilling inductive reasoning rules, rather than just inputs and outputs, yields superior generalization to novel inputs and improves efficiency, suggesting practical benefits for deploying open-weight models in complex reasoning tasks. The work also highlights the importance of evaluating hypothesis quality via demonstrated rule satisfaction and offers a framework for data augmentation, SFT, and alignment that could extend to other reasoning-intensive domains.

Abstract

Language models rely on semantic priors to perform in-context learning, which leads to poor performance on tasks involving inductive reasoning. Instruction-tuning methods based on imitation learning can superficially enhance the in-context learning performance of language models, but they often fail to improve the model's understanding of the underlying rules that connect inputs and outputs in few-shot demonstrations. We propose ReDis, a reasoning distillation technique designed to improve the inductive reasoning capabilities of language models. Through a careful combination of data augmentation, filtering, supervised fine-tuning, and alignment, ReDis achieves significant performance improvements across a diverse range of tasks, including 1D-ARC, List Function, ACRE, and MiniSCAN. Experiments on three language model backbones show that ReDis outperforms equivalent few-shot prompting baselines across all tasks and even surpasses the teacher model, GPT-4o, in some cases. ReDis, based on the LLaMA-3 backbone, achieves relative improvements of 23.2%, 2.8%, and 66.6% over GPT-4o on 1D-ARC, ACRE, and MiniSCAN, respectively, within a similar hypothesis search space. The code, dataset, and model checkpoints will be made available at https://github.com/NafisSadeq/reasoning-distillation.git.

Improving In-Context Learning with Reasoning Distillation

TL;DR

This work tackles the challenge of inductive reasoning in in-context learning by proposing ReDis, a black-box reasoning distillation pipeline that uses a teacher model to generate candidate hypotheses, evaluates their fitness through noisy rule-following, and then applies supervised fine-tuning and preference alignment to produce a student capable of efficient hypothesis search. The method achieves substantial performance gains across four diverse inductive-reasoning tasks (List Function, 1D ARC, ACRE, MiniSCAN) on multiple backbones, and even matches or surpasses GPT-4o in some settings, while significantly reducing inference-time token costs. A key contribution is the ORPO alignment strategy, which optimizes a combined loss to favor high-quality rule generation within a smaller search space. The results demonstrate that distilling inductive reasoning rules, rather than just inputs and outputs, yields superior generalization to novel inputs and improves efficiency, suggesting practical benefits for deploying open-weight models in complex reasoning tasks. The work also highlights the importance of evaluating hypothesis quality via demonstrated rule satisfaction and offers a framework for data augmentation, SFT, and alignment that could extend to other reasoning-intensive domains.

Abstract

Language models rely on semantic priors to perform in-context learning, which leads to poor performance on tasks involving inductive reasoning. Instruction-tuning methods based on imitation learning can superficially enhance the in-context learning performance of language models, but they often fail to improve the model's understanding of the underlying rules that connect inputs and outputs in few-shot demonstrations. We propose ReDis, a reasoning distillation technique designed to improve the inductive reasoning capabilities of language models. Through a careful combination of data augmentation, filtering, supervised fine-tuning, and alignment, ReDis achieves significant performance improvements across a diverse range of tasks, including 1D-ARC, List Function, ACRE, and MiniSCAN. Experiments on three language model backbones show that ReDis outperforms equivalent few-shot prompting baselines across all tasks and even surpasses the teacher model, GPT-4o, in some cases. ReDis, based on the LLaMA-3 backbone, achieves relative improvements of 23.2%, 2.8%, and 66.6% over GPT-4o on 1D-ARC, ACRE, and MiniSCAN, respectively, within a similar hypothesis search space. The code, dataset, and model checkpoints will be made available at https://github.com/NafisSadeq/reasoning-distillation.git.

Paper Structure

This paper contains 33 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of data augmentation, filtering and model tuning in ReDis. The teacher and student models are shown in purple and black, respectively. Each entry in the rule generation corpus contains multiple few-shot demonstrations. Each entry in the SFT and preference alignment corpora includes both the rule generation and the corresponding rule-following instructions. These details are omitted for clarity.
  • Figure 2: The impact of hypothesis size on performance improvement of ReDis-Llama. The results are shown for hypothesis size of $1, 3, 5, 7,$ and $10$.
  • Figure 3: The impact of decoding temperature on the performance of a) ReDis-Mistral b) ReDis-Llama on List function task. We perform grid search within the range between $0.6$ and $1.0$ for both rule generation and rule following and find the best performance for a rule generation temperature of $0.9$ and rule following temperature of $0.7$.
  • Figure 4: Score distribution of chosen and rejected rules in the augmented dataset via noisy fitness estimation. $n$ denotes the maximum number of few-shot demonstrations, and $d$ denotes the minimum score difference between a (chosen, rejected) rule pair.