A Closer Look at the Limitations of Instruction Tuning
Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, Dinesh Manocha
TL;DR
This work critically examines Instruction Tuning as a knowledge amplifier for large language models. It systematically compares LoRA-based fine-tuning (LFT) with full-parameter fine-tuning (SFT) across multiple open-source LLMs and IT datasets, using both human and GPT-4 multi-aspect evaluations plus token-distribution analyses. The key findings are that LFT largely preserves pre-trained knowledge and yields better factuality, while SFT introduces new knowledge at the cost of content quality and increases hallucinations through causal borrowing from IT data; pattern copying from IT data generally harms performance, though simplifying IT responses can mitigate hallucinations. The study concludes that pre-trained knowledge remains the dominant factor and suggests future work in mitigating SFT-induced hallucinations and exploring hybrid approaches that leverage concise IT data with strong pre-trained grounding. Overall, the paper provides important guidance for developing robust, open-domain chat models by highlighting where IT falls short and where it can still be effective.
Abstract
Instruction Tuning (IT), the process of training large language models (LLMs) using instruction-response pairs, has emerged as the predominant method for transforming base pre-trained LLMs into open-domain conversational agents. While IT has achieved notable success and widespread adoption, its limitations and shortcomings remain underexplored. In this paper, through rigorous experiments and an in-depth analysis of the changes LLMs undergo through IT, we reveal various limitations of IT. In particular, we show that (1) IT fails to enhance knowledge or skills in LLMs. LoRA fine-tuning is limited to learning response initiation and style tokens, and full-parameter fine-tuning leads to knowledge degradation. (2) Copying response patterns from IT datasets derived from knowledgeable sources leads to a decline in response quality. (3) Full-parameter fine-tuning increases hallucination by inaccurately borrowing tokens from conceptually similar instances in the IT dataset for generating responses. (4) Popular methods to improve IT do not lead to performance improvements over a simple LoRA fine-tuned model. Our findings reveal that responses generated solely from pre-trained knowledge consistently outperform responses by models that learn any form of new knowledge from IT on open-source datasets. We hope the insights and challenges revealed in this paper inspire future work in related directions.
