Table of Contents
Fetching ...

CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions

Jun Rao, Xuebo Liu, Lian Lian, Shengjun Cheng, Yunjie Liao, Min Zhang

TL;DR

CommonIT introduces a data-partitioning approach to instruction tuning that leverages data commonality by grouping IT data along Task, Embedding, and Length and training with single-group mini-batches in a two-phase workflow. By enforcing batch coherence within groups while varying groups across batches, CommonIT improves instruction-following across general and domain-specific tasks and across diverse foundation models. The method is validated on multiple IT datasets and benchmarks, showing robust gains in general knowledge, reasoning, multilinguality, and coding tasks, with metric-specific gains depending on the grouping criterion. The work highlights the potential of data-centric training strategies beyond simple data mixing, offering practical guidance on grouping choices and demonstrating scalability and applicability, albeit with resource and theoretical analysis remaining as future work.

Abstract

With instruction tuning, Large Language Models (LLMs) can enhance their ability to adhere to commands. Diverging from most works focusing on data mixing, our study concentrates on enhancing the model's capabilities from the perspective of data sampling during training. Drawing inspiration from the human learning process, where it is generally easier to master solutions to similar topics through focused practice on a single type of topic, we introduce a novel instruction tuning strategy termed CommonIT: Commonality-aware Instruction Tuning. Specifically, we cluster instruction datasets into distinct groups with three proposed metrics (Task, Embedding and Length). We ensure each training mini-batch, or "partition", consists solely of data from a single group, which brings about both data randomness across mini-batches and intra-batch data similarity. Rigorous testing on LLaMa models demonstrates CommonIT's effectiveness in enhancing the instruction-following capabilities of LLMs through IT datasets (FLAN, CoT, and Alpaca) and models (LLaMa2-7B, Qwen2-7B, LLaMa 13B, and BLOOM 7B). CommonIT consistently boosts an average improvement of 2.1\% on the general domain (i.e., the average score of Knowledge, Reasoning, Multilinguality and Coding) with the Length metric, and 5.2\% on the special domain (i.e., GSM, Openfunctions and Code) with the Task metric, and 3.8\% on the specific tasks (i.e., MMLU) with the Embedding metric. Code is available at \url{https://github.com/raojay7/CommonIT}.

CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions

TL;DR

CommonIT introduces a data-partitioning approach to instruction tuning that leverages data commonality by grouping IT data along Task, Embedding, and Length and training with single-group mini-batches in a two-phase workflow. By enforcing batch coherence within groups while varying groups across batches, CommonIT improves instruction-following across general and domain-specific tasks and across diverse foundation models. The method is validated on multiple IT datasets and benchmarks, showing robust gains in general knowledge, reasoning, multilinguality, and coding tasks, with metric-specific gains depending on the grouping criterion. The work highlights the potential of data-centric training strategies beyond simple data mixing, offering practical guidance on grouping choices and demonstrating scalability and applicability, albeit with resource and theoretical analysis remaining as future work.

Abstract

With instruction tuning, Large Language Models (LLMs) can enhance their ability to adhere to commands. Diverging from most works focusing on data mixing, our study concentrates on enhancing the model's capabilities from the perspective of data sampling during training. Drawing inspiration from the human learning process, where it is generally easier to master solutions to similar topics through focused practice on a single type of topic, we introduce a novel instruction tuning strategy termed CommonIT: Commonality-aware Instruction Tuning. Specifically, we cluster instruction datasets into distinct groups with three proposed metrics (Task, Embedding and Length). We ensure each training mini-batch, or "partition", consists solely of data from a single group, which brings about both data randomness across mini-batches and intra-batch data similarity. Rigorous testing on LLaMa models demonstrates CommonIT's effectiveness in enhancing the instruction-following capabilities of LLMs through IT datasets (FLAN, CoT, and Alpaca) and models (LLaMa2-7B, Qwen2-7B, LLaMa 13B, and BLOOM 7B). CommonIT consistently boosts an average improvement of 2.1\% on the general domain (i.e., the average score of Knowledge, Reasoning, Multilinguality and Coding) with the Length metric, and 5.2\% on the special domain (i.e., GSM, Openfunctions and Code) with the Task metric, and 3.8\% on the specific tasks (i.e., MMLU) with the Embedding metric. Code is available at \url{https://github.com/raojay7/CommonIT}.
Paper Structure (38 sections, 2 equations, 12 figures, 15 tables, 1 algorithm)

This paper contains 38 sections, 2 equations, 12 figures, 15 tables, 1 algorithm.

Figures (12)

  • Figure 1: Because of the mix of instructions, the LLM cannot understand the specific task of the different instructions well after IT shi2023specialist. It fails to recognize the instructions of the translation task in this case, "Please translate the following sentence ...", and simply replies with the final phrase "Can I change it?".
  • Figure 2: An overview of the baseline (IT) and our CommonIT. The different shapes and colors in the figure indicate a property of the data that can be used for grouping (task, statistics and embedding), and we use shapes as an example here. The CommonIT strategy inputs training data to the model during training from a group, e.g., batch $t$ from class (a). The model calculates the loss ${L}_t(\boldsymbol{\theta})$ of this partitioned data to update the parameters of the model.
  • Figure 3: TSNE plots for MMLU with 10 question types across four disciplines (Humanities, Social, STEM and Other). Clusters are tighter and distinguishable in our proposed CommonIT (Embedding), indicating that CommonIT can better differentiate the question's discipline type.
  • Figure 4: The figure on the left shows the training loss varies with the training step (Epoch 2). The bar chart on the right shows our results with the baseline for MMLU-0 shot at different training epochs. This indicates that CommonIT achieves a lower final language modeling loss than the baseline and that extending the number of training epochs further improves the performance.
  • Figure 5: The FLAN dataset groups by task.
  • ...and 7 more figures