Table of Contents
Fetching ...

TPD: Enhancing Student Language Model Reasoning via Principle Discovery and Guidance

Haorui Wang, Rongzhi Zhang, Yinghao Li, Lingkai Kong, Yuchen Zhuang, Xiusi Chen, Chao Zhang

TL;DR

TPD presents a principled teacher–student framework that distills high-level corrective principles from a stronger LLM’s analysis of a weaker LLM’s errors. The two-stage process first generates a problem-solving instruction and a principle list, then exploits these by curating informative in-context examples to guide offline learning, eliminating the need for ongoing teacher assistance at inference. Across eight symbolic and arithmetic reasoning tasks, TPD yields notable improvements over standard CoT, with the best results achieved when using error-derived principles to select examples. The work highlights the practicality of offline, principle-based knowledge transfer and points to extensions for long, scalable principle lists and applications to more complex reasoning tasks.

Abstract

Large Language Models (LLMs) have recently showcased remarkable reasoning abilities. However, larger models often surpass their smaller counterparts in reasoning tasks, posing the challenge of effectively transferring these capabilities from larger models. Existing approaches heavily rely on extensive fine-tuning data or continuous interactions with a superior teacher LLM during inference. We introduce a principle-based teacher-student framework called ``Teaching via Principle Discovery'' (TPD) to address these limitations. Inspired by human learning mechanisms, TPD mimics the interaction between a teacher and a student using a principle-based approach. The teacher LLM generates problem-solving instructions and corrective principles based on the student LLM's errors. These principles guide the refinement of instructions and the selection of instructive examples from a validation set. This enables the student model to learn from both the teacher's guidance and its own mistakes. Once the student model begins making inferences, TPD requires no further intervention from the teacher LLM or humans. Through extensive experiments across eight reasoning tasks, we demonstrate the effectiveness of TPD. Compared to standard chain-of-thought prompting, TPD significantly improves the student model's performance, achieving $6.2\%$ improvement on average.

TPD: Enhancing Student Language Model Reasoning via Principle Discovery and Guidance

TL;DR

TPD presents a principled teacher–student framework that distills high-level corrective principles from a stronger LLM’s analysis of a weaker LLM’s errors. The two-stage process first generates a problem-solving instruction and a principle list, then exploits these by curating informative in-context examples to guide offline learning, eliminating the need for ongoing teacher assistance at inference. Across eight symbolic and arithmetic reasoning tasks, TPD yields notable improvements over standard CoT, with the best results achieved when using error-derived principles to select examples. The work highlights the practicality of offline, principle-based knowledge transfer and points to extensions for long, scalable principle lists and applications to more complex reasoning tasks.

Abstract

Large Language Models (LLMs) have recently showcased remarkable reasoning abilities. However, larger models often surpass their smaller counterparts in reasoning tasks, posing the challenge of effectively transferring these capabilities from larger models. Existing approaches heavily rely on extensive fine-tuning data or continuous interactions with a superior teacher LLM during inference. We introduce a principle-based teacher-student framework called ``Teaching via Principle Discovery'' (TPD) to address these limitations. Inspired by human learning mechanisms, TPD mimics the interaction between a teacher and a student using a principle-based approach. The teacher LLM generates problem-solving instructions and corrective principles based on the student LLM's errors. These principles guide the refinement of instructions and the selection of instructive examples from a validation set. This enables the student model to learn from both the teacher's guidance and its own mistakes. Once the student model begins making inferences, TPD requires no further intervention from the teacher LLM or humans. Through extensive experiments across eight reasoning tasks, we demonstrate the effectiveness of TPD. Compared to standard chain-of-thought prompting, TPD significantly improves the student model's performance, achieving improvement on average.
Paper Structure (34 sections, 6 figures, 5 tables, 1 algorithm)

This paper contains 34 sections, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of TPD. To generate corrective principles, the teacher model first generates problem-solving instructions and then summarizes principles based on errors made by the student model on validation questions. During principle exploitation, the problem-solving instruction and examples that illustrate the principles are combined into the prompt to guide student learning.
  • Figure 2: TPD pipeline. It contains two stages: principle generation and principle exploitation. In principle generation, the student model generates answers according to the problem-solving instructions from the teacher model. Then, the teacher provides a list of principles based on student's practice errors. In principle exploitation, the teacher model refines the instruction and chooses representative examples, which are used by the student for inference.
  • Figure 3: The test accuracy of different numbers of examples in the problem-solving instruction in (a) symbolic reasoning tasks and (b) arithmetic reasoning tasks.
  • Figure 4: An ablation study on the number of examples selected based on the principle list. 0 example means the prompt only contains the modified problem-solving instruction.
  • Figure 5: Ablation studies on (a) whether the teacher model needs to provide problem-solving methods in the problem-solving instruction and (b) how to utilize selected examples with the modified problem-solving instruction.
  • ...and 1 more figures