TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise

Nan He; Hanyu Lai; Chenyang Zhao; Zirui Cheng; Junting Pan; Ruoyu Qin; Ruofan Lu; Rui Lu; Yunchen Zhang; Gangming Zhao; Zhaohui Hou; Zhiyuan Huang; Shaoqing Lu; Ding Liang; Mingjie Zhan

TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise

Nan He, Hanyu Lai, Chenyang Zhao, Zirui Cheng, Junting Pan, Ruoyu Qin, Ruofan Lu, Rui Lu, Yunchen Zhang, Gangming Zhao, Zhaohui Hou, Zhiyuan Huang, Shaoqing Lu, Ding Liang, Mingjie Zhan

TL;DR

The paper tackles the challenge that small language models struggle with reasoning by introducing TeacherLM, a family of open-source, explanation-rich teachers that generate fundamentals, chain-of-thought, and common-mistake annotations for each sample. It trains BLOOM-based models via a multi-stage, multi-task pipeline on large-scale datasets (P3-Sense-3K, Muffin-3W, TeacherData-2M) to produce specialized teachers (Fundamental, CoT, CommonMistake) and demonstrates their effectiveness in augmenting 58 NLP datasets and improving student models, achieving a zero-shot MMLU score of 52.3 with 7.1B parameters and 59.8 with 176B. The work provides detailed evaluation protocols and shows that data-augmented reasoning can rival or exceed larger models in zero-shot settings, while remaining cost-effective and open-source. Overall, TeacherLM offers a scalable, accessible path to endow smaller models with robust reasoning capabilities through structured, sample-level explanations.

Abstract

Large Language Models (LLMs) exhibit impressive reasoning and data augmentation capabilities in various NLP tasks. However, what about small models? In this work, we propose TeacherLM-7.1B, capable of annotating relevant fundamentals, chain of thought, and common mistakes for most NLP samples, which makes annotation more than just an answer, thus allowing other models to learn "why" instead of just "what". The TeacherLM-7.1B model achieved a zero-shot score of 52.3 on MMLU, surpassing most models with over 100B parameters. Even more remarkable is its data augmentation ability. Based on TeacherLM-7.1B, we augmented 58 NLP datasets and taught various student models with different parameters from OPT and BLOOM series in a multi-task setting. The experimental results indicate that the data augmentation provided by TeacherLM has brought significant benefits. We will release the TeacherLM series of models and augmented datasets as open-source.

TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise

TL;DR

Abstract

Paper Structure (25 sections, 1 equation, 5 figures, 7 tables)

This paper contains 25 sections, 1 equation, 5 figures, 7 tables.

Introduction
Related Work
Training TeacherLM
Dataset Construction
Training Procedure
Models
Multi-stage training
Evaluation Protocol
Benchmark
Evaluation Method
Results
Zero-shot Setting
Multi-stage Training Helps A Lot
Add Subject To Prompt
Teaching Student
...and 10 more sections

Figures (5)

Figure 1: TeacherLMs can perform augmentation on a wide range of datasets. They can leverage three different prompts to generate augmentations, including fundamentals, CoT, and common mistakes, providing complete information needed to solve the problem. The results of multitask training using augmented and unaugmented P3-Sense-3K data show that the augmented data can make the BLOOM-7.1B student model perform better in zero-shot performance on 47 tasks in the MMLU(57 tasks) benchmark.
Figure 2: Average MMLU scores (%) for 57 tasks with model and human accuracy comparisons. TeacherLMs are in the 0-shot setting, and the rest are in the 5-shot setting.
Figure 3: Different student models' performance in terms of augmentation of TeacherLM-7.1B. We augmented P3-Sense-3K(58 NLP datasets) and trained student models using the multi-task method. The x-axis indicates training token numbers, while the y-axis indicates zero-shot scores on various datasets. The gray line labeled "Plain" represents student models' accuracy without training. The orange line represents accuracy fluctuation when student models are trained directly on the original P3-Sense-3K, and the blue line represents the effect of training on the augmented P3-Sense-3K dataset.
Figure 4: Example showing a common sense question, with its manual explanations from the original dataset and CoT, fundamentals, and common mistakes generated by text-davinci-003 and TeacherLM-7.1B-7.1B, from top to bottom.
Figure 5: More examples generated by TeacherLM-7.1B and text-davinci-003, with its manual explanations from the original dataset and CoT, fundamentals, and common mistakes generated by text-davinci-003 and TeacherLM-7.1B, from top to bottom.

TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise

TL;DR

Abstract

TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise

Authors

TL;DR

Abstract

Table of Contents

Figures (5)