Table of Contents
Fetching ...

A Teacher Is Worth A Million Instructions

Nikhil Kothari, Ravindra Nayak, Shreyas Shetty, Amey Patil, Nikesh Garera

TL;DR

The paper tackles the data quality and generalization challenges of instruction tuning for relatively small LLMs by presenting a two-stage training framework that combines knowledge distillation from large teacher models with a post-training Domain Alignment from Expert (DAE). The KD component leverages prediction-layer and attention-based losses to transfer knowledge from bigger models to 7B and 13B-scale students, while DAE injects domain-specific knowledge using a domain expert and a reference model to preserve generalization. Empirical results on MT-Bench and AlpacaEval show that the proposed KD and DAE approaches can surpass state-of-the-art models with more parameters, particularly in e-commerce tasks, and can maintain broad task generalization. The work argues that KD is a viable training paradigm for smaller models and that domain-aligned training can be effectively achieved with limited domain data, offering a practical path to deploying capable domain-aware LLMs. Overall, the combination of KD and DAE expands the toolkit for scalable, domain-aware instruction-tuned LLMs."

Abstract

Large Language Models(LLMs) have shown exceptional abilities, yet training these models can be quite challenging. There is a strong dependence on the quality of data and finding the best instruction tuning set. Further, the inherent limitations in training methods create substantial difficulties to train relatively smaller models with 7B and 13B parameters. In our research, we suggest an improved training method for these models by utilising knowledge from larger models, such as a mixture of experts (8x7B) architectures. The scale of these larger models allows them to capture a wide range of variations from data alone, making them effective teachers for smaller models. Moreover, we implement a novel post-training domain alignment phase that employs domain-specific expert models to boost domain-specific knowledge during training while preserving the model's ability to generalise. Fine-tuning Mistral 7B and 2x7B with our method surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to $7.9$ in MT-Bench and $93.04\%$ on AlpacaEval.

A Teacher Is Worth A Million Instructions

TL;DR

The paper tackles the data quality and generalization challenges of instruction tuning for relatively small LLMs by presenting a two-stage training framework that combines knowledge distillation from large teacher models with a post-training Domain Alignment from Expert (DAE). The KD component leverages prediction-layer and attention-based losses to transfer knowledge from bigger models to 7B and 13B-scale students, while DAE injects domain-specific knowledge using a domain expert and a reference model to preserve generalization. Empirical results on MT-Bench and AlpacaEval show that the proposed KD and DAE approaches can surpass state-of-the-art models with more parameters, particularly in e-commerce tasks, and can maintain broad task generalization. The work argues that KD is a viable training paradigm for smaller models and that domain-aligned training can be effectively achieved with limited domain data, offering a practical path to deploying capable domain-aware LLMs. Overall, the combination of KD and DAE expands the toolkit for scalable, domain-aware instruction-tuned LLMs."

Abstract

Large Language Models(LLMs) have shown exceptional abilities, yet training these models can be quite challenging. There is a strong dependence on the quality of data and finding the best instruction tuning set. Further, the inherent limitations in training methods create substantial difficulties to train relatively smaller models with 7B and 13B parameters. In our research, we suggest an improved training method for these models by utilising knowledge from larger models, such as a mixture of experts (8x7B) architectures. The scale of these larger models allows them to capture a wide range of variations from data alone, making them effective teachers for smaller models. Moreover, we implement a novel post-training domain alignment phase that employs domain-specific expert models to boost domain-specific knowledge during training while preserving the model's ability to generalise. Fine-tuning Mistral 7B and 2x7B with our method surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to in MT-Bench and on AlpacaEval.
Paper Structure (15 sections, 6 equations, 4 figures, 2 tables)

This paper contains 15 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Model performance on MT-Bench. We compare Flip-2x7B-Instruct, trained with KD and DAE, to proprietary as well as larger, open-access models like Llama-2-70B-chat.
  • Figure 2: Self-Attention States: $a_{ij}$ represents the attention given to the $i$-th token when processing or generating the $j$-th token. Causal masking prevents attention to future tokens by setting their attention weight to zero.
  • Figure 3: DAE: domain samples refer to the domain expert as their teacher, while non-domain samples refer to the reference model. The stacked distributions is considered as the "true" distribution.
  • Figure 4: Figure 1(a) and 1(b) depict the progression of attention loss and prediction layer loss, respectively. Training is conducted on attention-only loss (green), prediction layer-only loss (blue), and both combined (orange). Despite backpropagation occurring on one or both losses, a natural correlation emerges, along with gradient magnitudes aligning with expected trends. In Figure 2(a), we observe the Domain Alignment from Expert (DAE) training loss on Mistral 7B (green) and 2x7B Mixture of Expert (blue). Notably, the loss curve for MoE 2x7B is significantly lower than that of Mistral 7B, as anticipated.