Table of Contents
Fetching ...

Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities

Raman Dutt, Harleen Hanspal, Guoxuan Xia, Petru-Daniel Tudosiu, Alexander Black, Yongxin Yang, Steven McDonagh, Sarah Parisot

TL;DR

This work addresses extending uni-modal large language models to multimodal generation without sacrificing text capabilities or incurring prohibitive parameter costs. It leverages latent MoE redundancy by converting a dense LLM to a Mixture-of-Experts, applying Partial LoRA only to image tokens, and using a Gromov-Wasserstein distance-based initialization to align image and text embeddings. The approach yields modality-specific routing and reduced expert redundancy, enabling competitive image generation with only 7.5M training samples and low compute, while preserving near-original language performance. This parameter-efficient multimodal pathway offers a scalable route to integrating additional modalities with minimal performance loss and computational overhead.

Abstract

In this work, we undertake the challenge of augmenting the existing generative capabilities of pre-trained text-only large language models (LLMs) with multi-modal generation capability while satisfying two core constraints: C1 preserving the preservation of original language generative capabilities with negligible performance degradation, and C2 adhering to a small parameter budget to learn the new modality, ensuring scalability and efficiency. In contrast to current approaches that add dedicated modules, thereby significantly increasing the parameter count, we propose a method that leverages the underutilized capacity inherent in deep models. Specifically, we exploit the parameter redundancy within Mixture-of-Experts (MoEs) as a source of additional capacity for learning a new modality, enabling better parameter efficiency (C1). Moreover, we preserve the original language generation capabilities by applying low-rank adaptation exclusively to the tokens of the new modality (C2). Furthermore, we introduce a novel parameter initialization scheme based on the Gromov-Wasserstein distance to improve convergence and training stability. Through an extensive analysis of the routing mechanism, we uncover the emergence of modality-specific pathways and decreased redundancy within the experts that can efficiently unlock multi-modal generative capabilities. Overall, our method can be seamlessly applied to a wide range of contemporary LLMs, providing a new pathway for transitioning from uni-modal to multi-modal architectures.

Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities

TL;DR

This work addresses extending uni-modal large language models to multimodal generation without sacrificing text capabilities or incurring prohibitive parameter costs. It leverages latent MoE redundancy by converting a dense LLM to a Mixture-of-Experts, applying Partial LoRA only to image tokens, and using a Gromov-Wasserstein distance-based initialization to align image and text embeddings. The approach yields modality-specific routing and reduced expert redundancy, enabling competitive image generation with only 7.5M training samples and low compute, while preserving near-original language performance. This parameter-efficient multimodal pathway offers a scalable route to integrating additional modalities with minimal performance loss and computational overhead.

Abstract

In this work, we undertake the challenge of augmenting the existing generative capabilities of pre-trained text-only large language models (LLMs) with multi-modal generation capability while satisfying two core constraints: C1 preserving the preservation of original language generative capabilities with negligible performance degradation, and C2 adhering to a small parameter budget to learn the new modality, ensuring scalability and efficiency. In contrast to current approaches that add dedicated modules, thereby significantly increasing the parameter count, we propose a method that leverages the underutilized capacity inherent in deep models. Specifically, we exploit the parameter redundancy within Mixture-of-Experts (MoEs) as a source of additional capacity for learning a new modality, enabling better parameter efficiency (C1). Moreover, we preserve the original language generation capabilities by applying low-rank adaptation exclusively to the tokens of the new modality (C2). Furthermore, we introduce a novel parameter initialization scheme based on the Gromov-Wasserstein distance to improve convergence and training stability. Through an extensive analysis of the routing mechanism, we uncover the emergence of modality-specific pathways and decreased redundancy within the experts that can efficiently unlock multi-modal generative capabilities. Overall, our method can be seamlessly applied to a wide range of contemporary LLMs, providing a new pathway for transitioning from uni-modal to multi-modal architectures.

Paper Structure

This paper contains 26 sections, 4 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Overall schematic of the proposed framework. (1) Dense pre-trained LLM is converted to its MoE variant. (2) Each expert in LLM-MoE is still a text-expert due to text pre-training. (3) The MHA block in the LLM is then modified with the PLoRA module and fine-tuned on multi-modal data. During fine-tuning, the routers learn to assign dedicated experts to image and text modalities. (4) We illustrate the PLoRA module, which applies low-rank adaptation exclusively to the image tokens in an input sequence containing both image (yellow) and text tokens (blue).
  • Figure 2: Example generated samples using our approach. The images exhibit high fidelity and maintain strong textual coherence. See Appendix \ref{['sec:text_coherence']}, \ref{['sec:more_examples']}, and \ref{['sec:gen_prompts']} for examples showing strong textual coherence, more generated samples, and the associated prompts, respectively.
  • Figure 2: FID score (FID) and Inception Score (IS) for LLaMA-MoE fine-tuned using LoRA and PLoRA on MSCOCO, CUB and Oxford datasets.
  • Figure 3: Comparison of training loss convergence behaviour across different parameter initialization schemes.
  • Figure 4: Average expert redundancy (co-activation) across each layer before and after multi-modal fine-tuning. The observed reduction in average expert redundancy after fine-tuning indicates that redundant experts were leveraged to learn the new modality.
  • ...and 9 more figures