Table of Contents
Fetching ...

Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data

Juncheng Xie, Shensian Syu, Hung-yi Lee

TL;DR

This work shows that instruction-following capabilities can emerge from non-instructional data by distilling knowledge from large LLMs and fine-tuning with LoRA. By distilling on OpenWebText using halved prompts from OpenAI models and also using Claude-derived data, the authors demonstrate consistent improvements across MT-Bench, Open LLM Leaderboard, and Arena Hard benchmarks, with some configurations matching or surpassing instruction-tuned baselines (e.g., Arena Hard score of 57.0 for Meta-Llama-3-70b-Instruct). The results challenge the necessity of explicit instruction-focused data, highlighting a scalable path to alignment and instruction-following that leverages high-quality teacher models. They also compare with MAGPIE and Alpaca baselines to contextualize data efficiency and discuss limitations and future directions for understanding the underlying mechanisms and ensuring real-world generalization.

Abstract

Instruction fine-tuning is crucial for today's large language models (LLMs) to learn to follow instructions and align with human preferences. Conventionally, supervised data, including the instruction and the correct response, is required for instruction fine-tuning. To obtain such data, some researchers prompted well-trained models like GPT-4 to generate instructions and correct responses. In this paper, we propose a novel approach that uses the first half of a random text from OpenWebText as the instruction and GPT-3.5-turbo or GPT-4-turbo to complete the text as the response. Despite the data being "non-instructional", we found that pre-trained LLMs fine-tuned on this data can gain instruction-following capabilities. This observation is verified by fine-tuning several well-known pre-trained LLMs (e.g., LLaMA-2-7B, LLaMA-3-8B, LLaMA-3-70B, Mistral-7B-v0.1). The "non-instructional data" also improved some models that underwent supervised fine-tuning and human preference alignment. Our LLaMA-3-70B-Instruct fine-tuned through "non-instructional data" is comparable with LLaMA-3.1-70B-Instruct on the Arena Hard leaderboard. We analyzed the "non-instructional data" and ensured it is devoid of content related to instruction fine-tuning. Our findings will inspire further investigation into how to develop instruction-following capabilities without explicit instruction-related data.

Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data

TL;DR

This work shows that instruction-following capabilities can emerge from non-instructional data by distilling knowledge from large LLMs and fine-tuning with LoRA. By distilling on OpenWebText using halved prompts from OpenAI models and also using Claude-derived data, the authors demonstrate consistent improvements across MT-Bench, Open LLM Leaderboard, and Arena Hard benchmarks, with some configurations matching or surpassing instruction-tuned baselines (e.g., Arena Hard score of 57.0 for Meta-Llama-3-70b-Instruct). The results challenge the necessity of explicit instruction-focused data, highlighting a scalable path to alignment and instruction-following that leverages high-quality teacher models. They also compare with MAGPIE and Alpaca baselines to contextualize data efficiency and discuss limitations and future directions for understanding the underlying mechanisms and ensuring real-world generalization.

Abstract

Instruction fine-tuning is crucial for today's large language models (LLMs) to learn to follow instructions and align with human preferences. Conventionally, supervised data, including the instruction and the correct response, is required for instruction fine-tuning. To obtain such data, some researchers prompted well-trained models like GPT-4 to generate instructions and correct responses. In this paper, we propose a novel approach that uses the first half of a random text from OpenWebText as the instruction and GPT-3.5-turbo or GPT-4-turbo to complete the text as the response. Despite the data being "non-instructional", we found that pre-trained LLMs fine-tuned on this data can gain instruction-following capabilities. This observation is verified by fine-tuning several well-known pre-trained LLMs (e.g., LLaMA-2-7B, LLaMA-3-8B, LLaMA-3-70B, Mistral-7B-v0.1). The "non-instructional data" also improved some models that underwent supervised fine-tuning and human preference alignment. Our LLaMA-3-70B-Instruct fine-tuned through "non-instructional data" is comparable with LLaMA-3.1-70B-Instruct on the Arena Hard leaderboard. We analyzed the "non-instructional data" and ensured it is devoid of content related to instruction fine-tuning. Our findings will inspire further investigation into how to develop instruction-following capabilities without explicit instruction-related data.
Paper Structure (37 sections, 2 figures, 15 tables)

This paper contains 37 sections, 2 figures, 15 tables.

Figures (2)

  • Figure 1: Our framework for distillation involves using a specific dataset to prompt ChatGPT for continued writing, simulating a targeted context.
  • Figure 2: Data size v.s. MT-Bench Score