OrchMoE: Efficient Multi-Adapter Learning with Task-Skill Synergy

Haowen Wang; Tao Sun; Kaixiang Ji; Jian Wang; Cong Fan; Jinjie Gu

OrchMoE: Efficient Multi-Adapter Learning with Task-Skill Synergy

Haowen Wang, Tao Sun, Kaixiang Ji, Jian Wang, Cong Fan, Jinjie Gu

TL;DR

OrchMoE tackles the challenge of efficient multi-task learning for large language models by introducing a dual-router, mixture-of-experts style PEFT framework that automatically infers task categories and allocates specialized skills. Central to the approach are the Task Router and Skill Router, which drive a Task-Skill Allocation Matrix to dynamically pair inputs with abstract tasks and adapter skills, all implemented with low-rank LoRA modules placed within a unified architecture. Empirical results on the Super NI dataset show that OrchMoE outperforms strong PEFT baselines across model scales (e.g., T5-XXL, GLM-10B) and task regimes, including unseen tasks, while maintaining parameter parity and demonstrating superior transfer efficiency. The work demonstrates significant practical impact by enabling scalable, data-efficient multi-task adaptation in large models without requiring explicit task identifiers, leveraging abstract task notions and soft skill allocations to boost performance and generalization.

Abstract

We advance the field of Parameter-Efficient Fine-Tuning (PEFT) with our novel multi-adapter method, OrchMoE, which capitalizes on modular skill architecture for enhanced forward transfer in neural networks. Unlike prior models that depend on explicit task identification inputs, OrchMoE automatically discerns task categories, streamlining the learning process. This is achieved through an integrated mechanism comprising an Automatic Task Classification module and a Task-Skill Allocation module, which collectively deduce task-specific classifications and tailor skill allocation matrices. Our extensive evaluations on the 'Super Natural Instructions' dataset, featuring 1,600 diverse instructional tasks, indicate that OrchMoE substantially outperforms comparable multi-adapter baselines in terms of both performance and sample utilization efficiency, all while operating within the same parameter constraints. These findings suggest that OrchMoE offers a significant leap forward in multi-task learning efficiency.

OrchMoE: Efficient Multi-Adapter Learning with Task-Skill Synergy

TL;DR

Abstract

Paper Structure (19 sections, 7 equations, 3 figures, 4 tables)

This paper contains 19 sections, 7 equations, 3 figures, 4 tables.

Introduction
Preliminaries
Methodology
Task Router
Skill Router
Parameter Efficiency
Empirical Experiments
Experimental Setup
Training Details
Main Results and Discussion
Comparative Analysis of Model Performance Across Task Scales
Transfer Learning Analysis on Unseen Tasks
Comparative Performance Analysis of $\texttt{OrchMoE}$ Across Model Scales
Parameter Efficiency Analysis
In-Depth Analysis of Learned Task and Skills
...and 4 more sections

Figures (3)

Figure 1: $\texttt{OrchMoE}$ Infrastructure
Figure 2: RougeLsum of PEFT methods on SuperNI 100 Tasks dataset when applied on T5-XXL. The X-axis shows the trainable parameter count during the fine-tuning process.
Figure 3: Task clustering dendrogram for Task-skill allocation matrix $W$ of $\texttt{OrchMoE}$ using GLM-10B as the base model, set 100 abstract tasks in NI-100-Tasks experiment. Tasks are grouped into the same category if they share a similar subset of skills.

OrchMoE: Efficient Multi-Adapter Learning with Task-Skill Synergy

TL;DR

Abstract

OrchMoE: Efficient Multi-Adapter Learning with Task-Skill Synergy

Authors

TL;DR

Abstract

Table of Contents

Figures (3)