Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities

Raman Dutt; Harleen Hanspal; Guoxuan Xia; Petru-Daniel Tudosiu; Alexander Black; Yongxin Yang; Steven McDonagh; Sarah Parisot

Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities

Raman Dutt, Harleen Hanspal, Guoxuan Xia, Petru-Daniel Tudosiu, Alexander Black, Yongxin Yang, Steven McDonagh, Sarah Parisot

TL;DR

This work addresses extending uni-modal large language models to multimodal generation without sacrificing text capabilities or incurring prohibitive parameter costs. It leverages latent MoE redundancy by converting a dense LLM to a Mixture-of-Experts, applying Partial LoRA only to image tokens, and using a Gromov-Wasserstein distance-based initialization to align image and text embeddings. The approach yields modality-specific routing and reduced expert redundancy, enabling competitive image generation with only 7.5M training samples and low compute, while preserving near-original language performance. This parameter-efficient multimodal pathway offers a scalable route to integrating additional modalities with minimal performance loss and computational overhead.

Abstract

In this work, we undertake the challenge of augmenting the existing generative capabilities of pre-trained text-only large language models (LLMs) with multi-modal generation capability while satisfying two core constraints: C1 preserving the preservation of original language generative capabilities with negligible performance degradation, and C2 adhering to a small parameter budget to learn the new modality, ensuring scalability and efficiency. In contrast to current approaches that add dedicated modules, thereby significantly increasing the parameter count, we propose a method that leverages the underutilized capacity inherent in deep models. Specifically, we exploit the parameter redundancy within Mixture-of-Experts (MoEs) as a source of additional capacity for learning a new modality, enabling better parameter efficiency (C1). Moreover, we preserve the original language generation capabilities by applying low-rank adaptation exclusively to the tokens of the new modality (C2). Furthermore, we introduce a novel parameter initialization scheme based on the Gromov-Wasserstein distance to improve convergence and training stability. Through an extensive analysis of the routing mechanism, we uncover the emergence of modality-specific pathways and decreased redundancy within the experts that can efficiently unlock multi-modal generative capabilities. Overall, our method can be seamlessly applied to a wide range of contemporary LLMs, providing a new pathway for transitioning from uni-modal to multi-modal architectures.

Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities

TL;DR

Abstract

Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)