Table of Contents
Fetching ...

TinyThinker: Distilling Reasoning through Coarse-to-Fine Knowledge Internalization with Self-Reflection

Shengmin Piao, Sanghyun Park

TL;DR

TinyThinker addresses the risk of superficial imitation when distilling reasoning from large language models by introducing a coarse-to-fine knowledge internalization framework. It couples a three-stage reasoning process—recall, analyze, summarize—with a two-phase training regimen: reasoning acquisition and self-reflection guided by iterative Direct Preference Optimization (DPO). Empirical results on CommonsenseQA, OpenBookQA, and StrategyQA show consistent gains, especially for OBQA and StrategyQA, with ablations confirming the value of each component and its scalability to larger student models. The approach offers a flexible, knowledge-centric path to endow smaller models with robust reasoning capabilities, with potential extensions to other knowledge-intensive tasks and future improvements in data quality and generation efficiency.

Abstract

Large Language Models exhibit impressive reasoning capabilities across diverse tasks, motivating efforts to distill these capabilities into smaller models through generated reasoning data. However, direct training on such synthesized reasoning data may lead to superficial imitation of reasoning process, rather than fostering a genuine integration of reasoning capabilities with underlying knowledge. To address this, we propose TinyThinker, a framework introducing two novel approaches. First, we introduce a three-stage process that incrementally guides the student model through the reasoning process, progressively refining knowledge from coarse to fine granularity. Second, we develop a two-phase training framework comprising an initial reasoning acquisition phase followed by a self-reflection phase utilizing self-generated data. Experiments on commonsense reasoning benchmarks demonstrate that TinyThinker achieves superior performance compared to baselines. Ablation studies further validate the effectiveness of each component in our framework. We expect that TinyThinker can be extended to other knowledge-intensive reasoning tasks, offering an alternative strategy for developing effective reasoning capabilities in smaller language models. Codes are available at https://github.com/shengminp/TinyThinker

TinyThinker: Distilling Reasoning through Coarse-to-Fine Knowledge Internalization with Self-Reflection

TL;DR

TinyThinker addresses the risk of superficial imitation when distilling reasoning from large language models by introducing a coarse-to-fine knowledge internalization framework. It couples a three-stage reasoning process—recall, analyze, summarize—with a two-phase training regimen: reasoning acquisition and self-reflection guided by iterative Direct Preference Optimization (DPO). Empirical results on CommonsenseQA, OpenBookQA, and StrategyQA show consistent gains, especially for OBQA and StrategyQA, with ablations confirming the value of each component and its scalability to larger student models. The approach offers a flexible, knowledge-centric path to endow smaller models with robust reasoning capabilities, with potential extensions to other knowledge-intensive tasks and future improvements in data quality and generation efficiency.

Abstract

Large Language Models exhibit impressive reasoning capabilities across diverse tasks, motivating efforts to distill these capabilities into smaller models through generated reasoning data. However, direct training on such synthesized reasoning data may lead to superficial imitation of reasoning process, rather than fostering a genuine integration of reasoning capabilities with underlying knowledge. To address this, we propose TinyThinker, a framework introducing two novel approaches. First, we introduce a three-stage process that incrementally guides the student model through the reasoning process, progressively refining knowledge from coarse to fine granularity. Second, we develop a two-phase training framework comprising an initial reasoning acquisition phase followed by a self-reflection phase utilizing self-generated data. Experiments on commonsense reasoning benchmarks demonstrate that TinyThinker achieves superior performance compared to baselines. Ablation studies further validate the effectiveness of each component in our framework. We expect that TinyThinker can be extended to other knowledge-intensive reasoning tasks, offering an alternative strategy for developing effective reasoning capabilities in smaller language models. Codes are available at https://github.com/shengminp/TinyThinker

Paper Structure

This paper contains 39 sections, 4 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Comparison between TinyThinker and standard Chain-of-Thought Distillation. Top: Fine-tuning the student directly on teacher-generated reasoning data. Bottom: TinyThinker acquires reasoning capabilities through a three-stage process, further refined via self-reflection.
  • Figure 2: Detailed process of TinyThinker. Reasoning Acquisition: The student model follows a recall-analyze-summarize process, refining reasoning from coarse to fine granularity. Self-reflection: The model iteratively collects data and applies DPO. Pairwise data is first collected during the recall stage, and the preferred data from this stage informs the collection of pairwise data in the analyze stage. Once sufficient data is gathered, DPO is applied to refine the student's reasoning capabilities, facilitating progression to the next iteration of self-reflection.
  • Figure 3: Overall process of the training strategy. Top: During the reasoning acquisition phase, the recall-analyze-summarize process is repeated iteratively. Bottom: During the self-reflection phase, the recall-analyze process is iterated with DPO.
  • Figure 4: Accuracy (%) on CSQA and StrategyQA datasets across different model sizes.
  • Figure 5: Ablation study on the effects of each stage in the recall-analyze-summarize reasoning process.