A Teacher Is Worth A Million Instructions
Nikhil Kothari, Ravindra Nayak, Shreyas Shetty, Amey Patil, Nikesh Garera
TL;DR
The paper tackles the data quality and generalization challenges of instruction tuning for relatively small LLMs by presenting a two-stage training framework that combines knowledge distillation from large teacher models with a post-training Domain Alignment from Expert (DAE). The KD component leverages prediction-layer and attention-based losses to transfer knowledge from bigger models to 7B and 13B-scale students, while DAE injects domain-specific knowledge using a domain expert and a reference model to preserve generalization. Empirical results on MT-Bench and AlpacaEval show that the proposed KD and DAE approaches can surpass state-of-the-art models with more parameters, particularly in e-commerce tasks, and can maintain broad task generalization. The work argues that KD is a viable training paradigm for smaller models and that domain-aligned training can be effectively achieved with limited domain data, offering a practical path to deploying capable domain-aware LLMs. Overall, the combination of KD and DAE expands the toolkit for scalable, domain-aware instruction-tuned LLMs."
Abstract
Large Language Models(LLMs) have shown exceptional abilities, yet training these models can be quite challenging. There is a strong dependence on the quality of data and finding the best instruction tuning set. Further, the inherent limitations in training methods create substantial difficulties to train relatively smaller models with 7B and 13B parameters. In our research, we suggest an improved training method for these models by utilising knowledge from larger models, such as a mixture of experts (8x7B) architectures. The scale of these larger models allows them to capture a wide range of variations from data alone, making them effective teachers for smaller models. Moreover, we implement a novel post-training domain alignment phase that employs domain-specific expert models to boost domain-specific knowledge during training while preserving the model's ability to generalise. Fine-tuning Mistral 7B and 2x7B with our method surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to $7.9$ in MT-Bench and $93.04\%$ on AlpacaEval.
