Exploring Training on Heterogeneous Data with Mixture of Low-rank Adapters
Yuhang Zhou, Zihua Zhao, Haolin Li, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang
TL;DR
This work tackles the challenge of training a unified model on heterogeneous data from diverse domains and tasks, where gradient conflicts can hinder learning. It introduces Mixture of Low-rank Adapters (MoLA), a framework that attaches multiple low-rank adapters to a shared backbone and combines them via task-aware or router-based mechanisms. The two variants, MoLA-Grad and MoLA-Router, provide explicit and implicit gradient separation, respectively, with MoLA-Router augmented by a Task-wise Decorrelaton (TwD) loss to encourage task-discriminative adapter mixing. Across domain, multi-input-task, and single-input-task heterogeneity, MoLA demonstrates superior performance, improved parameter efficiency, and scalable training for large-scale heterogeneous data scenarios. The approach holds practical significance for applications in healthcare, computer vision, and multimodal modeling where diverse data sources must be leveraged effectively.
Abstract
Training a unified model to take multiple targets into account is a trend towards artificial general intelligence. However, how to efficiently mitigate the training conflicts among heterogeneous data collected from different domains or tasks remains under-explored. In this study, we explore to leverage Mixture of Low-rank Adapters (MoLA) to mitigate conflicts in heterogeneous data training, which requires to jointly train the multiple low-rank adapters and their shared backbone. Specifically, we introduce two variants of MoLA, namely, MoLA-Grad and MoLA-Router, to respectively handle the target-aware and target-agnostic scenarios during inference. The former uses task identifiers to assign personalized low-rank adapters to each task, disentangling task-specific knowledge towards their adapters, thereby mitigating heterogeneity conflicts. The latter uses a novel Task-wise Decorrelation (TwD) loss to intervene the router to learn oriented weight combinations of adapters to homogeneous tasks, achieving similar effects. We conduct comprehensive experiments to verify the superiority of MoLA over previous state-of-the-art methods and present in-depth analysis on its working mechanism. Source code is available at: https://github.com/MediaBrain-SJTU/MoLA
