Table of Contents
Fetching ...

MoORE: SVD-based Model MoE-ization for Conflict- and Oblivion-Resistant Multi-Task Adaptation

Shen Yuan, Yin Zheng, Taifeng Wang, Binbin Liu, Hongteng Xu

TL;DR

MoORE introduces a principled SVD-based model MoE-ization that converts a pre-trained weight matrix into a complete mixture of orthogonal rank-one experts, enabling conflict- and oblivion-resistant multi-task adaptation. By decomposing $W$ as $W = U \operatorname{diag}(\boldsymbol{\sigma}) V^{\top}$ and treating each rank-one term $\mathbf{u}_d \mathbf{v}_d^{\top}$ as an expert, MoORE adds a learnable router combining task- and sample-level cues and couples it with a learnable orthogonal adapter (Householder-based) to boost capacity while preserving the original column space $\text{Range}(W)$. This design yields orthogonal, non-redundant experts and maintains pre-training capabilities, reducing interference across tasks and forgetting of prior tasks. Experiments on CSR-MTL, NLU-MTL, and OR-MTL show that MoORE improves conflict- and oblivion-resistance and achieves competitive inference efficiency versus baselines like LoRA- and MixLoRA-based MoEs. Overall, MoORE provides a scalable, intrinsic MoE formulation for multi-task adaptation with strong empirical gains and practical efficiency.

Abstract

Adapting large-scale foundation models in multi-task scenarios often suffers from task conflict and oblivion. To mitigate such issues, we propose a novel ''model MoE-ization'' strategy that leads to a conflict- and oblivion-resistant multi-task adaptation method. Given a weight matrix of a pre-trained model, our method applies SVD to it and introduces a learnable router to adjust its singular values based on tasks and samples. Accordingly, the weight matrix becomes a Mixture of Orthogonal Rank-one Experts (MoORE), in which each expert corresponds to the outer product of a left singular vector and the corresponding right one. We can improve the model capacity by imposing a learnable orthogonal transform on the right singular vectors. Unlike low-rank adaptation (LoRA) and its MoE-driven variants, MoORE guarantees the experts' orthogonality and maintains the column space of the original weight matrix. These two properties make the adapted model resistant to the conflicts among the new tasks and the oblivion of its original tasks, respectively. Experiments on various datasets demonstrate that MoORE outperforms existing multi-task adaptation methods consistently, showing its superiority in terms of conflict- and oblivion-resistance. The code of the experiments is available at https://github.com/DaShenZi721/MoORE.

MoORE: SVD-based Model MoE-ization for Conflict- and Oblivion-Resistant Multi-Task Adaptation

TL;DR

MoORE introduces a principled SVD-based model MoE-ization that converts a pre-trained weight matrix into a complete mixture of orthogonal rank-one experts, enabling conflict- and oblivion-resistant multi-task adaptation. By decomposing as and treating each rank-one term as an expert, MoORE adds a learnable router combining task- and sample-level cues and couples it with a learnable orthogonal adapter (Householder-based) to boost capacity while preserving the original column space . This design yields orthogonal, non-redundant experts and maintains pre-training capabilities, reducing interference across tasks and forgetting of prior tasks. Experiments on CSR-MTL, NLU-MTL, and OR-MTL show that MoORE improves conflict- and oblivion-resistance and achieves competitive inference efficiency versus baselines like LoRA- and MixLoRA-based MoEs. Overall, MoORE provides a scalable, intrinsic MoE formulation for multi-task adaptation with strong empirical gains and practical efficiency.

Abstract

Adapting large-scale foundation models in multi-task scenarios often suffers from task conflict and oblivion. To mitigate such issues, we propose a novel ''model MoE-ization'' strategy that leads to a conflict- and oblivion-resistant multi-task adaptation method. Given a weight matrix of a pre-trained model, our method applies SVD to it and introduces a learnable router to adjust its singular values based on tasks and samples. Accordingly, the weight matrix becomes a Mixture of Orthogonal Rank-one Experts (MoORE), in which each expert corresponds to the outer product of a left singular vector and the corresponding right one. We can improve the model capacity by imposing a learnable orthogonal transform on the right singular vectors. Unlike low-rank adaptation (LoRA) and its MoE-driven variants, MoORE guarantees the experts' orthogonality and maintains the column space of the original weight matrix. These two properties make the adapted model resistant to the conflicts among the new tasks and the oblivion of its original tasks, respectively. Experiments on various datasets demonstrate that MoORE outperforms existing multi-task adaptation methods consistently, showing its superiority in terms of conflict- and oblivion-resistance. The code of the experiments is available at https://github.com/DaShenZi721/MoORE.

Paper Structure

This paper contains 22 sections, 5 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: (a) An illustration of our model MoE-ization strategy and the corresponding MoORE architecture. (b) The comparison for various multi-task adaptation methods on fine-tuning LLaMA-3.1 8B grattafiori2024llama3herdmodels on the CSR-MTL constructed by nine tasks allenai:arcclark2019boolqOpenBookQA2018Bisk2020sap2019socialiqazellers2019hellaswagsakaguchi2021winograndetalmor-etal-2019-commonsenseqa. MoORE consistently works better than the baselines when the number of tasks is larger than one. (c) Before adaptation, LLaMA-3.1 8B achieves encouraging overall performance (i.e., the gray dashed line) in seven tasks hendryckstest2021hendrycks2021ethicszhou2023instructionfollowingevaluationlargelanguagesuzgun2022challengingrein2024gpqachen2021evaluatingaustin2021programcobbe2021gsm8k. MoORE mitigates the performance degradation and outperforms the most competitive baseline LoRAMoE dou2024loramoe. (d) MoORE's runtime is comparable to that of its competitors. Compared to the original LLaMA-3.1 8B (i.e., the gray dashed line), MoORE increases the inference time moderately.
  • Figure 2: The loss of performance in OR-MTL. (a-g) The gray dashed line represents the performance of LLaMA-3.1 8B before adaptation. (h) The overall performance degradation across all tasks.
  • Figure 3: The visualization of normalized performance degradation and task correlation. The "difference" shown in the first row is the normalized performance degradation, i.e., $(\text{Acc}_{\text{Base}}-\text{Acc}_{\text{MoORE}})/100\%$. The following matrix records the normalized task correlation. The element in the $j$-th row and the $i$-th column is $\|\bm{g}_i-\bm{g}_j\|_2/{\max_{k,k'}\|\bm{g}_k-\bm{g}_k'\|_2}$.
  • Figure 4: The mean and variance of routing weights obtained in HellaS.
  • Figure 5: The mean and variance of routing weights obtained in four QA tasks.
  • ...and 1 more figures