Table of Contents
Fetching ...

Ensembles of Low-Rank Expert Adapters

Yinghao Li, Vianne Gao, Chao Zhang, MohamadAli Torkamani

TL;DR

This work tackles the problem of conflicting gradient directions during fine-tuning of large language models on diverse data. It introduces ELREA, an ensemble of low-rank adapters where a base LoRA adapter is first trained on full data, then data points are clustered by gradient directions using gradient features, random-projected to $d_{proj}$, and per-cluster LoRA experts are trained. At inference, predictions from the base adapter and the cluster experts are weighted by the input's gradient similarity to cluster centroids, with weights computed via standardized cosine similarities, i.e. $w_c = \mathrm{softmax}(\cos'(oldsymbol{\delta}_{test}, \bar{\boldsymbol{\delta}}_c))$, and the next-token logits are combined accordingly. Empirically, ELREA outperforms full-dataset LoRA baselines and other ensemble methods across domain-specific tasks, with ablations confirming the effectiveness of gradient-based clustering and routing, though at higher inference cost relative to single-model baselines.

Abstract

The training and fine-tuning of large language models (LLMs) often involve diverse textual data from multiple sources, which poses challenges due to conflicting gradient directions, hindering optimization and specialization. These challenges can undermine model generalization across tasks, resulting in reduced downstream performance. Recent research suggests that fine-tuning LLMs on carefully selected, task-specific subsets of data can match or even surpass the performance of using the entire dataset. Building on these insights, we propose the Ensembles of Low-Rank Expert Adapters (ELREA) framework to improve the model's capability to handle diverse tasks. ELREA clusters the training instructions based on their gradient directions, representing different areas of expertise and thereby reducing conflicts during optimization. Expert adapters are then trained on these clusters, utilizing the low-rank adaptation (LoRA) technique to ensure training efficiency and model scalability. During inference, ELREA combines predictions from the most relevant expert adapters based on the input data's gradient similarity to the training clusters, ensuring optimal adapter selection for each task. Experiments show that our method outperforms baseline LoRA adapters trained on the full dataset and other ensemble approaches with similar training and inference complexity across a range of domain-specific tasks.

Ensembles of Low-Rank Expert Adapters

TL;DR

This work tackles the problem of conflicting gradient directions during fine-tuning of large language models on diverse data. It introduces ELREA, an ensemble of low-rank adapters where a base LoRA adapter is first trained on full data, then data points are clustered by gradient directions using gradient features, random-projected to , and per-cluster LoRA experts are trained. At inference, predictions from the base adapter and the cluster experts are weighted by the input's gradient similarity to cluster centroids, with weights computed via standardized cosine similarities, i.e. , and the next-token logits are combined accordingly. Empirically, ELREA outperforms full-dataset LoRA baselines and other ensemble methods across domain-specific tasks, with ablations confirming the effectiveness of gradient-based clustering and routing, though at higher inference cost relative to single-model baselines.

Abstract

The training and fine-tuning of large language models (LLMs) often involve diverse textual data from multiple sources, which poses challenges due to conflicting gradient directions, hindering optimization and specialization. These challenges can undermine model generalization across tasks, resulting in reduced downstream performance. Recent research suggests that fine-tuning LLMs on carefully selected, task-specific subsets of data can match or even surpass the performance of using the entire dataset. Building on these insights, we propose the Ensembles of Low-Rank Expert Adapters (ELREA) framework to improve the model's capability to handle diverse tasks. ELREA clusters the training instructions based on their gradient directions, representing different areas of expertise and thereby reducing conflicts during optimization. Expert adapters are then trained on these clusters, utilizing the low-rank adaptation (LoRA) technique to ensure training efficiency and model scalability. During inference, ELREA combines predictions from the most relevant expert adapters based on the input data's gradient similarity to the training clusters, ensuring optimal adapter selection for each task. Experiments show that our method outperforms baseline LoRA adapters trained on the full dataset and other ensemble approaches with similar training and inference complexity across a range of domain-specific tasks.

Paper Structure

This paper contains 31 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The pipeline of ELREA for fine-tuning and inference. The data points (solid and hollow circles) do not necessarily have a geometric correspondence to their gradient directions (arrows).
  • Figure 2: Average weight distribution across clusters for different datasets and LoRA ranks. Only relative values matter. "M-C" represents MATH-Combined.
  • Figure 3: Effects of gradient projection dimensionality and selection of top-$k$ experts during inference on model performance.
  • Figure 4: Distribution of data sources and categories within each cluster for the MATH-Combined and GLUR (general language understanding and reasoning) training sets at rank $r=8$. Cluster indices are shown along the rows, while columns represent data sources and categories, formatted as "{source dataset}-{category}" for MATH-Combined and "{source dataset}" for GLUR. The color intensity reflects the sample count, with darker shades indicating higher counts. Each column is independently normalized, meaning scales may differ across columns. Color gradients are slightly curved to improve visibility for categories with fewer samples.
  • Figure 5: Examples of data clusters from MATH-Combined, generated using different random seeds in cases where the clusters are non-identical. The entire dataset is used for clustering, but only $10\%$ of the data is visualized for clarity. The 8192.0-dimensional gradient features are projected into 2D space using t-SNE. The colors are randomly assigned; the same color does not necessarily imply the same cluster across different seeds.