Table of Contents
Fetching ...

Exploring Training on Heterogeneous Data with Mixture of Low-rank Adapters

Yuhang Zhou, Zihua Zhao, Haolin Li, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang

TL;DR

This work tackles the challenge of training a unified model on heterogeneous data from diverse domains and tasks, where gradient conflicts can hinder learning. It introduces Mixture of Low-rank Adapters (MoLA), a framework that attaches multiple low-rank adapters to a shared backbone and combines them via task-aware or router-based mechanisms. The two variants, MoLA-Grad and MoLA-Router, provide explicit and implicit gradient separation, respectively, with MoLA-Router augmented by a Task-wise Decorrelaton (TwD) loss to encourage task-discriminative adapter mixing. Across domain, multi-input-task, and single-input-task heterogeneity, MoLA demonstrates superior performance, improved parameter efficiency, and scalable training for large-scale heterogeneous data scenarios. The approach holds practical significance for applications in healthcare, computer vision, and multimodal modeling where diverse data sources must be leveraged effectively.

Abstract

Training a unified model to take multiple targets into account is a trend towards artificial general intelligence. However, how to efficiently mitigate the training conflicts among heterogeneous data collected from different domains or tasks remains under-explored. In this study, we explore to leverage Mixture of Low-rank Adapters (MoLA) to mitigate conflicts in heterogeneous data training, which requires to jointly train the multiple low-rank adapters and their shared backbone. Specifically, we introduce two variants of MoLA, namely, MoLA-Grad and MoLA-Router, to respectively handle the target-aware and target-agnostic scenarios during inference. The former uses task identifiers to assign personalized low-rank adapters to each task, disentangling task-specific knowledge towards their adapters, thereby mitigating heterogeneity conflicts. The latter uses a novel Task-wise Decorrelation (TwD) loss to intervene the router to learn oriented weight combinations of adapters to homogeneous tasks, achieving similar effects. We conduct comprehensive experiments to verify the superiority of MoLA over previous state-of-the-art methods and present in-depth analysis on its working mechanism. Source code is available at: https://github.com/MediaBrain-SJTU/MoLA

Exploring Training on Heterogeneous Data with Mixture of Low-rank Adapters

TL;DR

This work tackles the challenge of training a unified model on heterogeneous data from diverse domains and tasks, where gradient conflicts can hinder learning. It introduces Mixture of Low-rank Adapters (MoLA), a framework that attaches multiple low-rank adapters to a shared backbone and combines them via task-aware or router-based mechanisms. The two variants, MoLA-Grad and MoLA-Router, provide explicit and implicit gradient separation, respectively, with MoLA-Router augmented by a Task-wise Decorrelaton (TwD) loss to encourage task-discriminative adapter mixing. Across domain, multi-input-task, and single-input-task heterogeneity, MoLA demonstrates superior performance, improved parameter efficiency, and scalable training for large-scale heterogeneous data scenarios. The approach holds practical significance for applications in healthcare, computer vision, and multimodal modeling where diverse data sources must be leveraged effectively.

Abstract

Training a unified model to take multiple targets into account is a trend towards artificial general intelligence. However, how to efficiently mitigate the training conflicts among heterogeneous data collected from different domains or tasks remains under-explored. In this study, we explore to leverage Mixture of Low-rank Adapters (MoLA) to mitigate conflicts in heterogeneous data training, which requires to jointly train the multiple low-rank adapters and their shared backbone. Specifically, we introduce two variants of MoLA, namely, MoLA-Grad and MoLA-Router, to respectively handle the target-aware and target-agnostic scenarios during inference. The former uses task identifiers to assign personalized low-rank adapters to each task, disentangling task-specific knowledge towards their adapters, thereby mitigating heterogeneity conflicts. The latter uses a novel Task-wise Decorrelation (TwD) loss to intervene the router to learn oriented weight combinations of adapters to homogeneous tasks, achieving similar effects. We conduct comprehensive experiments to verify the superiority of MoLA over previous state-of-the-art methods and present in-depth analysis on its working mechanism. Source code is available at: https://github.com/MediaBrain-SJTU/MoLA
Paper Structure (21 sections, 5 equations, 6 figures, 11 tables)

This paper contains 21 sections, 5 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Three common types of heterogeneous data. Left: Domain heterogeneity. Each target can correspond to multiple domain heterogeneous input data, such as multi-domain training; Middle: Multi-input task heterogeneity. Each target has its own input data, such as medical diagnosis; Right: Single-input task heterogeneity. Each task can have the same input data, such as scene understanding.
  • Figure 2: Blue rectangles represent shared modules and orange and green rectangles represent task-specific modules. The small dashed rectangles represent low-rank adapters. $\oplus$ calculates the weighted sum of MoLA based on the output of the shared router.
  • Figure 3: The proportion of principal component eigenvectors in the model's weight matrix. After using MoLA, the proportion significantly increases, indicating that more eigenvectors are utilized, which is beneficial for expressing task-specific directions.
  • Figure 4: The box-plot of eigenvalue distribution of weights at different layers. The outlier points above boxes correspond to the relatively large eigenvalues, indicating that its corresponding eigenvector plays an important role in feature extraction. The top-left corresponds to the parameters $W_0$ in the backbone. The top-right and bottom-left correspond to the parameter combinations of different low-rank adapters with $W_0$. The bottom-right corresponds to the parameter combination of a low-rank adapter with a higher rank and $W_0$.
  • Figure 5: The comparison of parameter number of different methods, and the influence of rank r selection on parameter number. The vertical axis corresponds to the "Params (M)". The black horizontal line represents the parameter count of the single task model.
  • ...and 1 more figures