Table of Contents
Fetching ...

AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs

Xuyang Wei, Chunlin Tian, Li Li

TL;DR

AsymLoRA tackles the challenge of instruction fine-tuning for Multimodal Large Language Models by addressing both modality-specific conflicts and cross-task commonalities. It introduces an asymmetric LoRA design with a shared low-rank matrix $A$ for universal knowledge and task-specific matrices $B_i$ for domain-specific adaptation, optionally extended with a Mixture-of-AsymLoRA-Experts to dynamically select adaptation paths via gating. The approach yields consistent improvements over vanilla LoRA and LoRA-MoE across diverse single-domain and multi-domain benchmarks, while reducing trainable parameters and increasing efficiency. This framework offers a scalable solution for robust multimodal instruction tuning and cross-domain transfer, with practical impact for building versatile MLLMs in diverse settings, albeit with some inference overhead and questions for theoretical grounding of the shared knowledge consolidation.

Abstract

Effective instruction fine-tuning on diverse image-text datasets is crucial for developing a versatile Multimodal Large Language Model (MLLM), where dataset composition dictates the model's adaptability across multimodal tasks. However, complex datasets often contain inherent conflicts -- stemming from modality-specific optimization objectives -- and latent commonalities that enable cross-task transfer, which most existing approaches handle separately. To bridge this gap, we introduce AsymLoRA, a parameter-efficient tuning framework that unifies knowledge modularization and cross-modal coordination via asymmetric LoRA: task-specific low-rank projections (matrix B) that preserve distinct adaptation pathways for conflicting objectives, and a shared projection (matrix A) that consolidates cross-modal commonalities. Extensive evaluations demonstrate that AsymLoRA consistently surpasses both vanilla LoRA, which captures only commonalities, and LoRA-MoE, which focuses solely on conflicts, achieving superior model performance and system efficiency across diverse benchmarks.\href{Code}{https://github.com/Clin0212/HydraLoRA/blob/main/MLLM-HydraLoRA/README.md}.

AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs

TL;DR

AsymLoRA tackles the challenge of instruction fine-tuning for Multimodal Large Language Models by addressing both modality-specific conflicts and cross-task commonalities. It introduces an asymmetric LoRA design with a shared low-rank matrix for universal knowledge and task-specific matrices for domain-specific adaptation, optionally extended with a Mixture-of-AsymLoRA-Experts to dynamically select adaptation paths via gating. The approach yields consistent improvements over vanilla LoRA and LoRA-MoE across diverse single-domain and multi-domain benchmarks, while reducing trainable parameters and increasing efficiency. This framework offers a scalable solution for robust multimodal instruction tuning and cross-domain transfer, with practical impact for building versatile MLLMs in diverse settings, albeit with some inference overhead and questions for theoretical grounding of the shared knowledge consolidation.

Abstract

Effective instruction fine-tuning on diverse image-text datasets is crucial for developing a versatile Multimodal Large Language Model (MLLM), where dataset composition dictates the model's adaptability across multimodal tasks. However, complex datasets often contain inherent conflicts -- stemming from modality-specific optimization objectives -- and latent commonalities that enable cross-task transfer, which most existing approaches handle separately. To bridge this gap, we introduce AsymLoRA, a parameter-efficient tuning framework that unifies knowledge modularization and cross-modal coordination via asymmetric LoRA: task-specific low-rank projections (matrix B) that preserve distinct adaptation pathways for conflicting objectives, and a shared projection (matrix A) that consolidates cross-modal commonalities. Extensive evaluations demonstrate that AsymLoRA consistently surpasses both vanilla LoRA, which captures only commonalities, and LoRA-MoE, which focuses solely on conflicts, achieving superior model performance and system efficiency across diverse benchmarks.\href{Code}{https://github.com/Clin0212/HydraLoRA/blob/main/MLLM-HydraLoRA/README.md}.

Paper Structure

This paper contains 15 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Illustration of LoRA architecture changes in AsymLoRA. (a) Vanilla LoRA applies a single shared adaptation across all tasks, relying solely on common synergies but introducing conflicts during fine-tuning. (b) Multiple task-specific LoRA modules mitigate interference by isolating tasks but focus only on differences, limiting generalization and increasing overhead. (c) AsymLoRA introduces an asymmetric structure with a shared A matrix for common knowledge and multiple B matrices for task-specific features, balancing generalization and efficiency.
  • Figure 2: Architecture and workflow of AsymLoRA. The shared low-rank matrix $A$ captures global knowledge across tasks, while task-specific low-rank matrices $B_i$ enable independent adaptation for each task