Table of Contents
Fetching ...

Mixture-of-Instructions: Aligning Large Language Models via Mixture Prompting

Bowen Xu, Shaoyu Wu, Kai Liu, Lulu Hu

TL;DR

This work tackles cross-task alignment of large language models by showing that system prompts can cause overfitting and guide performance in unintended directions. It proposes Mixture of Instructions (MoI), a multi-component training framework that combines domain-specific prompts, balanced packing of instructions, and chunk-based attention masking to enable effective multi-task supervised fine-tuning. The MoI approach is implemented on the Qwen-7B-chat base to produce Qwen-SFT-MoI, which demonstrates improved capabilities in mathematics, coding, tool use, and conversation across seven benchmarks, while reducing dataset bias. The results highlight MoI's potential to provide scalable, transfer-friendly alignment of LLMs across diverse domains without sacrificing existing conversational competencies.

Abstract

With the proliferation of large language models (LLMs), the comprehensive alignment of such models across multiple tasks has emerged as a critical area of research. Existing alignment methodologies primarily address single task, such as multi-turn dialogue, coding, mathematical problem-solving, and tool usage. Although there is a large amount of high-quality data available for those tasks, most of them provide only questions and answers without including the system prompt. Though a detailed analysis of the Qwen language model, we found that the system prompt has a significant impact on both training and inference processes of LLM. We attributes this phenomenon to overfitting to the system prompt. In address this issue, we introduce a novel technique termed Mixture-of-Instructions (MoI), which employs a strategy of instruction packing combined with diverse system prompts to boost the alignment efficiency of language models. We have also compiled a diverse set of seven benchmark datasets to rigorously evaluate the alignment efficacy of the MoI-enhanced language model. Our methodology was applied to the open-source Qwen-7B-chat model, culminating in the development of Qwen-SFT-MoI. This enhanced model demonstrates significant advancements in generative capabilities across coding, mathematics, and tool use tasks.

Mixture-of-Instructions: Aligning Large Language Models via Mixture Prompting

TL;DR

This work tackles cross-task alignment of large language models by showing that system prompts can cause overfitting and guide performance in unintended directions. It proposes Mixture of Instructions (MoI), a multi-component training framework that combines domain-specific prompts, balanced packing of instructions, and chunk-based attention masking to enable effective multi-task supervised fine-tuning. The MoI approach is implemented on the Qwen-7B-chat base to produce Qwen-SFT-MoI, which demonstrates improved capabilities in mathematics, coding, tool use, and conversation across seven benchmarks, while reducing dataset bias. The results highlight MoI's potential to provide scalable, transfer-friendly alignment of LLMs across diverse domains without sacrificing existing conversational competencies.

Abstract

With the proliferation of large language models (LLMs), the comprehensive alignment of such models across multiple tasks has emerged as a critical area of research. Existing alignment methodologies primarily address single task, such as multi-turn dialogue, coding, mathematical problem-solving, and tool usage. Although there is a large amount of high-quality data available for those tasks, most of them provide only questions and answers without including the system prompt. Though a detailed analysis of the Qwen language model, we found that the system prompt has a significant impact on both training and inference processes of LLM. We attributes this phenomenon to overfitting to the system prompt. In address this issue, we introduce a novel technique termed Mixture-of-Instructions (MoI), which employs a strategy of instruction packing combined with diverse system prompts to boost the alignment efficiency of language models. We have also compiled a diverse set of seven benchmark datasets to rigorously evaluate the alignment efficacy of the MoI-enhanced language model. Our methodology was applied to the open-source Qwen-7B-chat model, culminating in the development of Qwen-SFT-MoI. This enhanced model demonstrates significant advancements in generative capabilities across coding, mathematics, and tool use tasks.
Paper Structure (28 sections, 6 equations, 9 figures, 16 tables)

This paper contains 28 sections, 6 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Performance of Qwen-SFT-MoI and Qwen-7B-chat models, along with various SFT-aligned models on subdomain datasets, evaluated across seven datasets encompassing mathematics, programming, tool usage, common sense, and both single and multi-turn dialogues. Results demonstrate that training with our MoI method enhances multiple capabilities of language models, achieving improved alignment.
  • Figure 2: In the scenarios depicted, a system prompt was configured for the models. In the first case, model effectively responded to questions aligned with the system prompt settings.
  • Figure 3: Attention maps for responses to Question ID 127 in the MT-Bench, which show attention distribution across the system prompt, the question, and each model's answer. (a) Qwen-7B-chat focuses heavily on the prompt and question but incorrectly associates 'Boyer' with frequency tracking, misrepresenting the Boyer-Moore algorithm. (b) After SFT on code generation data, the model still overemphasizes the prompt and question, overlooks 'Boyer' and fixates on iterative element search. (c) Post-SFT with a new system prompt, the model shifts attention towards generating an answer, correctly hinting at the Boyer-Moore algorithm, which fundamentally tracks a candidate element.
  • Figure 4: (a). Sequence of Instructions involves extracting instructions from a dataset, tokenizing them, and using padding tokens to reach a fixed maximum length for SFT in an LLM. (b). Packed Instructions merges several instructions into a longer single instruction, minimizing the need for padding and enhancing training efficiency. (c).Balanced Packed Instructions employs balanced sampling from various datasets and concatenates instructions to meet a maximum token length. (d). Mixture of Instructions then prioritizes the instruction with the default system prompt at the start.
  • Figure 5: The numbers represent the simplified position IDs, which are used to generate the masked-out position embedding. Comparison of different attention masks on the same data: (a) default attention mask for sequence instruction concatenation, (b) default attention mask for balanced sampling concatenation, (c) specially designed mutually isolated attention mask, and (d) our chunk-based attention mask.
  • ...and 4 more figures