Table of Contents
Fetching ...

DEM: Distribution Edited Model for Training with Mixed Data Distributions

Dhananjay Ram, Aditya Rawal, Momchil Hardalov, Nikolaos Pappas, Sheng Zha

TL;DR

This paper proposes a simple and efficient alternative for better optimization of the data sources by combining models individually trained on each data source with the base model using basic element-wise vector operations, which is cheaper than standard data mixing and outperforms strong baselines on a variety of benchmarks.

Abstract

Training with mixed data distributions is a common and important part of creating multi-task and instruction-following models. The diversity of the data distributions and cost of joint training makes the optimization procedure extremely challenging. Data mixing methods partially address this problem, albeit having a sub-optimal performance across data sources and require multiple expensive training runs. In this paper, we propose a simple and efficient alternative for better optimization of the data sources by combining models individually trained on each data source with the base model using basic element-wise vector operations. The resulting model, namely Distribution Edited Model (DEM), is 11x cheaper than standard data mixing and outperforms strong baselines on a variety of benchmarks, yielding upto 6.2% improvement on MMLU, 11.5% on BBH, 16.1% on DROP, 6% on MathQA, and 9.3% on HELM with models of size 3B to 13B. Notably, DEM does not require full re-training when modifying a single data-source, thus making it very flexible and scalable for training with diverse data sources.

DEM: Distribution Edited Model for Training with Mixed Data Distributions

TL;DR

This paper proposes a simple and efficient alternative for better optimization of the data sources by combining models individually trained on each data source with the base model using basic element-wise vector operations, which is cheaper than standard data mixing and outperforms strong baselines on a variety of benchmarks.

Abstract

Training with mixed data distributions is a common and important part of creating multi-task and instruction-following models. The diversity of the data distributions and cost of joint training makes the optimization procedure extremely challenging. Data mixing methods partially address this problem, albeit having a sub-optimal performance across data sources and require multiple expensive training runs. In this paper, we propose a simple and efficient alternative for better optimization of the data sources by combining models individually trained on each data source with the base model using basic element-wise vector operations. The resulting model, namely Distribution Edited Model (DEM), is 11x cheaper than standard data mixing and outperforms strong baselines on a variety of benchmarks, yielding upto 6.2% improvement on MMLU, 11.5% on BBH, 16.1% on DROP, 6% on MathQA, and 9.3% on HELM with models of size 3B to 13B. Notably, DEM does not require full re-training when modifying a single data-source, thus making it very flexible and scalable for training with diverse data sources.
Paper Structure (26 sections, 4 equations, 3 figures, 12 tables)

This paper contains 26 sections, 4 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: The Distribution Edited Model ($\Theta_D$) results from fine-tuning a pretrained model ($\Theta$) on $n$ individual data distributions ($D_i$) and combining the resulting models with basic element-wise vector operations. Here, the combination is achieved by extracting distribution vectors ($\Delta \Theta_{D_i}$), multiplying them by weight coefficients ($\omega_i$), and adding their weighted sum to the base model.
  • Figure 2: tSNE representation of the fine-tuning datasets. The centroids of the datasets are marked as larger points with captions.
  • Figure 3: Layer-wise Euclidean distance, comparison between the base OpenLLaMA model, and the tuned models. Darker colors mean higher absolute difference. The euclidean distance values are normalized per-model by the highest layer-distance for that model. The plots are invariant to the scale of the weight change.